Recent Posts
Archives

Archive for the ‘General’ Category

PostHeaderIcon [DevoxxFR2014] Git-Deliver: Streamlining Deployment Beyond Java Ecosystems

Lecturer

Arnaud Bétrémieux is a passionate developer with 18 years of experience, including 8 professionally, specializing in open-source technologies, GNU/Linux, and languages like Java, PHP, and Lisp. He works at Key Consulting, providing development, hosting, consulting, and expertise services. Sylvain Veyrié, with nearly a decade in Java platforms, serves as Director of Delivery at Transparency Rights Management, focusing on big data, and has held roles in development, project management, and training at Key Consulting.

Abstract

This article investigates git-deliver, a deployment tool leveraging Git’s integrity guarantees for simple, traceable, and atomic deployments across diverse languages. It dissects the tool’s mechanics, from remote setup to rollback features, and discusses customization via scripts and presets, emphasizing its role in replacing ad-hoc scripts in dynamic language projects.

Core Principles and Setup

Git-deliver emerges as a Bash script extending Git with a “deliver” subcommand, aiming for simplicity, reliability, efficiency, and universality in deployments. Targeting non-Java environments like Node.js, PHP, or Rails, it addresses the pitfalls of custom scripts that introduce risks in traceability and atomicity.

A deployment target equates to a Git remote over SSH. For instance, creating remotes for test and production environments involves commands like git remote add test deliver@test.example.fr:/appli and git remote add prod deliver@example.fr:/appli. Deliveries invoke git deliver <remote> <version>, where version can be a branch, commit SHA, or tag.

On the target server, git-deliver initializes a bare Git repository alongside a “delivered” directory containing clones for each deployment. Each clone includes Git metadata and a working copy checked out to the specified version. Symbolic links, particularly “current,” point to the latest clone, ensuring a fixed path for applications and atomic switches— the link updates instantaneously, avoiding partial states.

Directory names incorporate timestamps and abbreviated SHAs, facilitating quick identification of deployed versions. This structure preserves history, enabling audits and rollbacks.

Information Retrieval and Rollback Mechanisms

To monitor deployments, git-deliver offers a “status” option. Without arguments, it surveys all remotes, reporting the current commit SHA, tag if applicable, deployment timestamp, and deployer. It also verifies integrity, alerting to uncommitted changes that might indicate manual tampering.

Specifying a remote yields a detailed history of all deliveries, including directory identifiers. Additionally, git-deliver auto-tags each deployment in the local repository, annotating with execution logs and optional messages. Pushing these tags to a central repository shares deployment history team-wide.

Rollback supports recovery: git deliver rollback <remote> reverts to the previous version by updating the “current” symlink to the prior clone. For specific versions, provide the directory name. This leverages preserved clones, ensuring exact restoration even if files were altered post-deployment.

Customization and Extensibility

Deployments divide into stages (e.g., init-remote for first-time setup, post-symlink for post-switch actions), allowing user-provided scripts executed at each. For normal deliveries, scripts might install dependencies or migrate databases; for rollbacks, they handle reversals like database adjustments.

To foster reusability, git-deliver introduces “presets”—collections of stage scripts for frameworks like Rails or Flask. Dependencies between presets (e.g., Rails depending on Ruby) enable modular composition. The “init” command copies preset scripts into a .deliver directory at the project root, customizable and versionable via Git.

This extensibility accommodates varied workflows, such as compiling sources on-server for compiled languages, though git-deliver primarily suits interpreted ones.

Broader Impact on Deployment Practices

By harnessing Git’s push mechanics and integrity checks, git-deliver minimizes errors from manual interventions, ensuring deployments are reproducible and auditable. Its atomic nature prevents service disruptions, crucial for production environments.

While not yet supporting distributed deployments natively, scripts can orchestrate multi-server coordination. Future enhancements might incorporate remote groups for parallel pushes.

In production at Key Consulting, git-deliver demonstrates maturity beyond prototyping, offering a lightweight alternative to complex tools, promoting standardized practices across projects.

Links:

PostHeaderIcon [DevoxxBE2013] Architecting Android Applications with Dagger

Jake Wharton, an Android engineering luminary at Square, champions Dagger, a compile-time dependency injector revolutionizing Java and Android modularity. Creator of Retrofit and Butter Knife, Jake elucidates Dagger’s divergence from reflection-heavy alternatives like Guice, emphasizing its speed and testability. His session overviews injection principles, Android-specific scoping, and advanced utilities like Lazy and Assisted Injection, arming developers with patterns for clean, verifiable code.

Dagger, Jake stresses, decouples class behaviors from dependencies, fostering reusable, injectable components. Through live examples, he builds a Twitter client, showcasing modules for API wrappers and HTTP clients, ensuring seamless integration.

Dependency Injection Fundamentals

Jake defines injection as externalizing object wiring, promoting loose coupling. He contrasts manual factories with Dagger’s annotation-driven graphs, where @Inject fields auto-resolve dependencies.

This pattern, Jake demonstrates, simplifies testing—mock modules swap implementations effortlessly, isolating units.

Dagger in Android Contexts

Android’s lifecycle demands scoping, Jake explains: @Singleton for app-wide instances, activity-bound for UI components. He constructs an app graph, injecting Twitter services into activities.

Fragments and services, he notes, inherit parent scopes, minimizing boilerplate while preserving encapsulation.

Advanced Features and Utilities

Dagger’s extras shine: @Lazy defers creation, @Assisted blends factories with injection for parameterized objects. Jake demos provider methods in modules, binding interfaces dynamically.

JSR-330 compliance, augmented by @Module, ensures portability, though Jake clarifies Dagger’s compile-time limits preclude Guice’s AOP dynamism.

Testing and Production Tips

Unit tests leverage Mockito for mocks, Jake illustrates, verifying injections without runtime costs. Production graphs, he advises, tier via subcomponents, optimizing memory.

Dagger’s reflection-free speed, Jake concludes, suits resource-constrained Android, with Square’s hiring call underscoring real-world impact.

Links:

PostHeaderIcon (long tweet) How to display / modify Oracle XDB port?

Oracle XDB port runs by default on port 8080… which is quite an issue for Java web developpers, because of the collision with default port used by servlet engines such as Tomcat and Jetty.

To display Oracle XDB port, run:
[sql]select DBMS_XDB.GETHTTPPORT from dual;[/sql]

To change it (for instance to set it on port 9090), run
[sql]exec DBMS_XDB.SETHTTPPORT(9090);[/sql]

PostHeaderIcon (long tweet) Could not find backup for factory javax.faces.context.FacesContextFactory.

Case

On deploying a JSF 2.2 / Primefaces 5 application on Jetty 9, I got the following error:
[java]java.lang.IllegalStateException: Could not find backup for factory javax.faces.context.FacesContextFactory.[/java]

The issue seems linked to Jetty, since I could not reproduce the issue on Tomcat 8.

Quickfix

In the web.xml, add the following block:
[xml] <listener>
<listener-class>com.sun.faces.config.ConfigureListener</listener-class>
</listener>[/xml]

PostHeaderIcon [DevoxxFR2014] PIT: Assessing Test Effectiveness Through Mutation Testing

Lecturer

Alexandre Victoor is a Java developer with nearly 15 years of experience, currently serving as an architect at Société Générale. His expertise spans software development, testing practices, and integration of tools for code quality assurance.

Abstract

This article examines the limitations of traditional code coverage metrics and introduces PIT as a mutation testing tool to evaluate the true effectiveness of unit tests. It analyzes how PIT injects faults into code to verify if tests detect them, discusses integration with build tools and SonarQube, and explores performance considerations, providing a deeper understanding of enhancing test suites in software engineering.

Challenges in Traditional Testing Metrics

In software development, particularly when practicing Test-Driven Development (TDD), the emphasis is often on writing tests before implementing functionality. This approach, originally termed “test first,” underscores the critical role of tests as a specification that could theoretically allow recreation of production code if lost. However, assessing the quality of these tests remains challenging.

Common metrics like line coverage and branch coverage indicate which parts of the code are executed during testing but fail to reveal if tests adequately detect defects. For instance, consider a simple function calculating a client price by applying a margin to a market price. Achieving 100% line coverage with a test for a zero-margin scenario does not guarantee detection of errors, such as changing an addition to a subtraction, as the test might still pass.

Complicating matters further, when introducing conditional logic or external dependencies mocked with frameworks like Mockito, 100% branch coverage can be attained without robust error detection. Default mock behaviors might always return zero, masking issues in conditional expressions. Thus, coverage metrics primarily highlight untested code but do not affirm the protective value of existing tests.

This gap necessitates advanced techniques to validate test efficacy, ensuring that modifications or bugs trigger failures. Mutation testing emerges as a solution, systematically introducing faults—termed mutants—into the code and observing if the test suite identifies them.

Implementing Mutation Testing with PIT

PIT, an open-source Java tool, operationalizes mutation testing by generating mutants and rerunning tests against each. If a test fails, the mutant is “killed,” indicating effective detection; if tests pass, the mutant “survives,” signaling a weakness in the test suite.

Integration into continuous integration pipelines is straightforward. After standard compilation and testing, PIT analyzes specified packages for code under test and corresponding test classes. It focuses on unit tests due to their speed and lack of side effects, avoiding interactions with databases or file systems that could complicate results.

PIT’s report details line-by-line coverage and mutation survival, highlighting areas where code executes but faults go undetected. Configuration options address common pitfalls: excluding logging statements to prevent false positives, as frameworks like Log4j or SLF4J calls do not impact functional outcomes; timeouts for mutants creating infinite loops; and parallel execution on multi-core machines to mitigate performance overhead from repeated test runs.

Optimizations include leveraging line coverage to run only relevant tests per mutant and incremental analysis to focus on changed code since the last run. These features make PIT viable for nightly builds, though not yet for every commit in fast-paced environments.

A SonarQube plugin extends PIT’s utility by creating violations for lines covered but not protected against mutants and introducing a “mutation coverage” metric. This represents the percentage of mutants killed; for example, 70% mutation coverage implies a 70% chance of detecting introduced anomalies.

Practical Implications and Recommendations

Adopting PIT requires team maturity in testing practices; starting with mutation testing without established TDD might be premature. For teams with solid unit tests, PIT reveals subtle deficiencies, encouraging refinements that bolster code reliability.

In real projects, well-TDD’ed code often shows high mutation coverage, aligning with 70-80% line coverage thresholds as acceptable benchmarks. Performance tuning, such as multi-threading and incremental modes, addresses scalability concerns.

Ultimately, PIT transforms testing from a coverage-focused exercise to one emphasizing defect detection, fostering more resilient software. Its ease of use—via command line, Ant, Gradle, or Maven—democratizes advanced quality assurance, urging developers to integrate it for comprehensive test validation.

Links:

PostHeaderIcon [DevoxxBE2013] Lambda: A Peek Under the Hood

Brian Goetz, Java Language Architect at Oracle, offers an illuminating dissection of lambda expressions in Java SE 8, transcending syntactic sugar to reveal the sophisticated machinery powering this evolution. Renowned for Java Concurrency in Practice and leadership in JSR 335, Brian demystifies lambdas’ implementation atop invokedynamic from Java SE 7. His session, eschewing introductory fare, probes the VM’s strategies for efficiency, contrasting naive inner-class approaches with optimized bootstrapping and serialization.

Lambdas, Brian asserts, unlock expressive potential for applications and libraries, but their true prowess lies in performance rivaling or surpassing inner classes—without the bloat. Through benchmarks and code dives, he showcases flexibility and future-proofing, underscoring the iterative path to a robust design.

From Syntax to Bytecode: The Bootstrap Process

Brian traces lambdas’ lifecycle: source code desugars to invokedynamic callsites, embedding a “recipe” for instantiation. The bootstrap method, invoked once per callsite, crafts a classfile dynamically, caching for reuse.

This declarative embedding, Brian illustrates, avoids inner classes’ per-instance overhead, yielding leaner bytecode and faster captures—non-capturing lambdas hit 1.5x inner-class speeds in early benchmarks.

Optimization Strategies and Capture Semantics

Capturing lambdas, Brian explains, leverage local variable slots via synthetic fields, minimizing allocations. He contrasts “eager” (immediate class creation) with “lazy” (deferred) strategies, favoring the latter for reduced startup.

Invokedynamic’s dynamic binding enables profile-guided refinements, promising ongoing gains. Brian’s throughput metrics affirm lambdas’ edge, even in capturing scenarios.

Serialization and Bridge Methods

Serializing lambdas invokes writeReplace to a serialized form, preserving semantics without runtime overhead. Brian demos bridge methods for functional interfaces, ensuring compatibility.

Default methods, he notes, extend interfaces safely, avoiding binary breakage—crucial for library evolution.

Lessons from Language Evolution

Brian reflects on Lambda’s odyssey: discarded ideas like inner-class syntactic variants paved the way for invokedynamic’s elegance. This resilience, he posits, exemplifies evolving languages amid obvious-but-flawed intuitions.

Project Lambda’s resources—OpenJDK docs, JCP reviews—invite deeper exploration, with binary builds for experimentation.

Links:

PostHeaderIcon [DevoxxFR2014] Cassandra: Entering a New Era in Distributed Databases

Lecturer

Jonathan Ellis is the project chair of Apache Cassandra and co-founder of DataStax (formerly Riptano), a company providing professional support for Cassandra. With over five years of experience working on Cassandra, starting from its origins at Facebook, Jonathan has been instrumental in evolving it from a specialized system into a general-purpose distributed database. His expertise lies in high-performance, scalable data systems, and he frequently speaks on topics related to NoSQL databases and big data technologies.

Abstract

This article explores the evolution and key features of Apache Cassandra as presented in a comprehensive overview of its design, applications, and recent advancements. It delves into Cassandra’s architecture for handling time-series data, multi-data center deployments, and distributed counters, while highlighting its integration with Hadoop and the introduction of lightweight transactions and CQL. The analysis underscores Cassandra’s strengths in performance, availability, and scalability, providing insights into its practical implications for modern applications and future developments.

Introduction to Apache Cassandra

Apache Cassandra, initially developed at Facebook in 2008, has rapidly evolved into a versatile distributed database system. Originally designed to handle the inbox messaging needs of a social media platform, Cassandra has transcended its origins to become a general-purpose solution applicable across various industries. This transformation is evident in its adoption by companies like eBay, Adobe, and Constant Contact, where it manages high-velocity data with demands for performance, availability, and scalability.

The core appeal of Cassandra lies in its ability to manage vast amounts of data across multiple nodes without a single point of failure. Unlike traditional relational databases, Cassandra employs a peer-to-peer architecture, ensuring that every node in the cluster is identical and capable of handling read and write operations. This design philosophy stems from the need to support applications that require constant uptime and the ability to scale horizontally by adding more commodity hardware.

In practical terms, Cassandra excels in scenarios involving time-series data, which includes sequences of data points indexed in time order. Examples range from Internet of Things (IoT) sensor readings to user activity logs in applications and financial transaction records. These data types benefit from Cassandra’s efficient storage and retrieval mechanisms, which prioritize chronological ordering and rapid ingestion rates.

Architectural Design and Data Distribution

At the heart of Cassandra’s architecture is its data distribution model, which uses consistent hashing to partition data across nodes. Each row in Cassandra is identified by a primary key, which is hashed using the Murmur3 algorithm to produce a 128-bit token. This token determines the node’s responsibility for storing the data, mapping keys to a virtual ring where nodes are assigned token ranges.

To enhance fault tolerance, Cassandra supports replication across multiple nodes. In a simple setup, replicas are placed by walking the ring clockwise, but production environments often employ rack-aware strategies to avoid placing multiple replicas on the same rack, mitigating risks from power or network failures. The introduction of virtual nodes (vnodes) in later versions allows each physical node to manage multiple token ranges, typically 256 per node, which balances load more evenly and simplifies cluster management.

Adding nodes to a cluster, known as bootstrapping, involves the new node randomly selecting tokens from existing nodes, followed by data streaming to transfer relevant partitions. This process occurs without service interruption, as existing nodes continue serving requests. Such mechanisms ensure linear scalability, where doubling the number of nodes roughly doubles the cluster’s capacity.

For multi-data center deployments, Cassandra optimizes cross-data center communication by sending updates to a single replica in the remote center, which then locally replicates the data. This approach minimizes bandwidth usage across expensive wide-area networks, making it suitable for hybrid environments combining on-premises data centers with cloud providers like AWS or Google Cloud.

Handling Distributed Counters and Integration with Analytics

One of Cassandra’s innovative features is its support for distributed counters, addressing the challenge of maintaining accurate counts in a replicated system. Traditional increment operations can lead to lost updates if concurrent clients overwrite each other’s changes. Cassandra resolves this by partitioning the counter value across replicas, where each replica maintains its own sub-counter. The total value is computed by summing these partitions during reads.

This design ensures eventual consistency while allowing high-throughput updates. For instance, if a counter starts at 3 and two replicas each increment by 2, the partitions update independently, and gossip protocols propagate the changes, resulting in a final value of 7 across all replicas.

Cassandra’s integration with Hadoop further extends its utility for analytical workloads. Beyond simple input formats for MapReduce jobs, Cassandra can partition a cluster into segments for operational workloads and others for analytics, automatically handling replication between them. This setup is ideal for recommendation systems, such as suggesting related products based on purchase history, where Hadoop computes correlations and replicates results back to the operational nodes.

Advancements in Transactions and Query Language

Prior to version 2.0, Cassandra lacked traditional transactions, relying on external lock managers like ZooKeeper for atomic operations. This approach introduced complexities, such as handling client failures during lock acquisition. To address this, Cassandra introduced lightweight transactions in version 2.0, enabling conditional inserts and updates using the Paxos consensus algorithm.

Paxos ensures fault-tolerant agreement among replicas, requiring four round trips per transaction, which increases latency. Thus, lightweight transactions are recommended sparingly, only when atomicity is critical, such as ensuring unique user account creation. The syntax integrates seamlessly with Cassandra Query Language (CQL), resembling SQL but omitting joins to maintain single-node query efficiency.

CQL, introduced in version 2.0, enhances developer productivity by providing a familiar interface for schema definition and querying. It supports collections (sets, lists, maps) for denormalization, avoiding the need for joins. Version 2.1 adds user-defined types and collection indexing, allowing nested structures and queries like selecting songs containing the tag “blues.”

Implications for Application Development

Cassandra’s design choices have profound implications for building resilient applications. Its emphasis on availability and partition tolerance aligns with the CAP theorem, prioritizing these over strict consistency in distributed settings. This makes it suitable for global applications where downtime is unacceptable.

For developers, features like triggers and virtual nodes reduce operational overhead, while CQL lowers the learning curve compared to thrift-based APIs. However, challenges remain, such as managing eventual consistency and avoiding overuse of transactions to preserve performance.

In production, companies like eBay leverage Cassandra for time-series data and multi-data center setups, citing its efficiency in bandwidth-constrained environments. Adobe uses it for audience management in the cloud, processing vast datasets with high availability.

Future Directions and Conclusion

Looking ahead, Cassandra continues to evolve, with version 2.1 introducing enhancements like new keywords for collection queries and improved indexing. The beta releases indicate stability, paving the way for broader adoption.

In conclusion, Cassandra represents a paradigm shift in database technology, offering scalable, high-performance solutions for modern data challenges. Its architecture, from consistent hashing to lightweight transactions, provides a robust foundation for applications demanding reliability across distributed environments. As organizations increasingly handle big data, Cassandra’s blend of simplicity and power positions it as a cornerstone for future innovations.

Links:

PostHeaderIcon [DevoxxFR2013] Lily: Big Data for Dummies – A Comprehensive Journey into Democratizing Apache Hadoop and HBase for Enterprise Java Developers

Lecturers

Steven Noels stands as one of the most visionary figures in the evolution of open-source Java ecosystems, having co-founded Outerthought in the early 2000s with a mission to push the boundaries of content management, RESTful architecture, and scalable data systems. His flagship creation, Daisy CMS, became a cornerstone for large-scale, multilingual content platforms used by governments and global enterprises, demonstrating that Java could power mission-critical, document-centric applications at internet scale. But Noels’ ambition extended far beyond traditional CMS. Recognizing the seismic shift toward big data in the late 2000s, he pivoted Outerthought—and later NGDATA—toward building tools that would make the Apache Hadoop ecosystem accessible to the average enterprise Java developer. Lily, launched in 2010, was the culmination of this vision: a platform that wrapped the raw power of HBase and Solr into a cohesive, Java-friendly abstraction layer, eliminating the need for MapReduce expertise or deep systems programming.

Bruno Guedes, an enterprise Java architect at SFEIR with over a decade of experience in distributed systems and search infrastructure, brought the practitioner’s perspective to the stage. Having worked with Lily from its earliest alpha versions, Guedes had deployed it in production environments handling millions of records, integrating it with legacy Java EE applications, Spring-based services, and real-time analytics pipelines. His hands-on experience—debugging schema migrations, tuning SolrCloud clusters, and optimizing HBase compactions—gave him unique insight into both the promise and the pitfalls of big data adoption in conservative enterprise settings. Together, Noels and Guedes formed a perfect synergy: the visionary architect and the battle-tested engineer, delivering a presentation that was equal parts inspiration and practical engineering.

Abstract

This article represents an exhaustively elaborated, deeply extended, and comprehensively restructured expansion of Steven Noels and Bruno Guedes’ seminal 2012 DevoxxFR presentation, “Lily, Big Data for Dummies”, transformed into a definitive treatise on the democratization of big data technologies for the Java enterprise. Delivered in a bilingual format that reflected the global nature of the Apache community, the original talk introduced Lily as a groundbreaking platform that unified Apache HBase’s scalable, distributed storage with Apache Solr’s full-text search and analytics capabilities, all through a clean, type-safe Java API. The core promise was radical in its simplicity: enterprise Java developers could build petabyte-scale, real-time searchable data systems without writing a single line of MapReduce, without mastering Zookeeper quorum mechanics, and without abandoning the comforts of POJOs, annotations, and IDE autocompletion.

This expanded analysis delves far beyond the original demo to explore the philosophical foundations of Lily’s design, the architectural trade-offs in integrating HBase and Solr, the real-world production patterns that emerged from early adopters, and the lessons learned from scaling Lily to billions of records. It includes detailed code walkthroughs, performance benchmarks, schema evolution strategies, and failure mode analyses.

EDIT:
Updated for the 2025 landscape, this piece maps Lily’s legacy concepts to modern equivalents—Apache HBase 2.5, SolrCloud 9, OpenSearch, Delta Lake, Trino, and Spring Data Hadoop—while preserving the original vision of big data for the rest of us. Through rich narratives, architectural diagrams, and forward-looking speculation, this work serves not just as a historical archive, but as a practical guide for any Java team contemplating the leap into distributed, searchable big data systems.

The Big Data Barrier in 2012: Why Hadoop Was Hard for Java Developers

To fully grasp Lily’s significance, one must first understand the state of big data in 2012. The Apache Hadoop ecosystem—launched in 2006—was already a proven force in internet-scale companies like Yahoo, Facebook, and Twitter. HDFS provided fault-tolerant, distributed storage. MapReduce offered a programming model for batch processing. HBase, modeled after Google’s Bigtable, delivered random, real-time read/write access to massive datasets. And Solr, forked from Lucene, powered full-text search at scale.

Yet for the average enterprise Java developer, this stack was inaccessible. Writing a MapReduce job required:
– Learning a functional programming model in Java that felt alien to OO practitioners.
– Mastering job configuration, input/output formats, and partitioners.
– Debugging distributed failures across dozens of nodes.
– Waiting minutes to hours for job completion.

HBase, while promising real-time access, demanded:
– Manual row key design to avoid hotspots.
– Deep knowledge of compaction, splitting, and region server tuning.
– Integration with Zookeeper for coordination.

Solr, though more familiar, required:
– Separate schema.xml and solrconfig.xml files.
– Manual index replication and sharding.
– Complex commit and optimization strategies.

The result? Big data remained the domain of specialized data engineers, not the Java developers who built the business logic. Lily was designed to change that.

Lily’s Core Philosophy: Big Data as a First-Class Java Citizen

At its heart, Lily was built on a simple but powerful idea: big data should feel like any other Java persistence layer. Just as Spring Data made MongoDB, Cassandra, or Redis accessible via repositories and annotations, Lily aimed to make HBase and Solr feel like JPA with superpowers.

The Three Pillars of Lily

Steven Noels articulated Lily’s architecture in three interconnected layers:

  1. The Storage Layer (HBase)
    Lily used HBase as its primary persistence engine, storing all data as versioned, column-family-based key-value pairs. But unlike raw HBase, Lily abstracted away row key design, column family management, and versioning policies. Developers worked with POJOs, and Lily handled the mapping.

  2. The Indexing Layer (Solr)
    Every mutation in HBase triggered an asynchronous indexing event to Solr. Lily maintained tight consistency between the two systems, ensuring that search results reflected the latest data within milliseconds. This was achieved through a message queue (Kafka or RabbitMQ) and idempotent indexing.

  3. The Java API Layer
    The crown jewel was Lily’s type-safe, annotation-driven API. Developers defined their data model using plain Java classes:

@LilyRecord
public class Customer {
    @LilyId
    private String id;

    @LilyField(family = "profile")
    private String name;

    @LilyField(family = "profile")
    private int age;

    @LilyField(family = "activity", indexed = true)
    private List<String> recentSearches;

    @LilyFullText
    private String bio;
}

The @LilyRecord annotation told Lily to persist this object in HBase. @LilyField specified column families and indexing behavior. @LilyFullText triggered Solr indexing. No XML. No schema files. Just Java.

The Lily Repository: Spring Data, But for Big Data

Lily’s LilyRepository interface was modeled after Spring Data’s CrudRepository, but with big data superpowers:

public interface CustomerRepository extends LilyRepository<Customer, String> {
    List<Customer> findByName(String name);

    @Query("age:[* TO 30]")
    List<Customer> findYoungCustomers();

    @Query("bio:java AND recentSearches:hadoop")
    List<Customer> findJavaHadoopEnthusiasts();
}

Behind the scenes, Lily:
– Translated method names to HBase scans.
– Converted @Query annotations to Solr queries.
– Executed searches across sharded SolrCloud clusters.
– Returned fully hydrated POJOs.

Bruno Guedes demonstrated this in a live demo:

CustomerRepository repo = lily.getRepository(CustomerRepository.class);
repo.save(new Customer("1", "Alice", 28, Arrays.asList("java", "hadoop"), "Java dev at NGDATA"));
List<Customer> results = repo.findJavaHadoopEnthusiasts();

The entire operation—save, index, search—took under 50ms on a 3-node cluster.

Under the Hood: How Lily Orchestrated HBase and Solr

Lily’s magic was in its orchestration layer. When a save() was called:
1. The POJO was serialized to HBase Put operations.
2. The mutation was written to HBase with a version timestamp.
3. A change event was published to a message queue.
4. A Solr indexer consumed the event and updated the search index.
5. Near-real-time consistency was guaranteed via HBase’s WAL and Solr’s soft commits.

For reads:
findById → HBase Get.
findByName → HBase scan with secondary index.
@Query → Solr query with HBase post-filtering.

This dual-write, eventual consistency model was a deliberate trade-off for performance and scalability.

Schema Evolution and Versioning: The Enterprise Reality

One of Lily’s most enterprise-friendly features was schema evolution. In HBase, adding a column family requires manual admin intervention. In Lily, it was automatic:

// Version 1
@LilyField(family = "profile")
private String email;

// Version 2
@LilyField(family = "profile")
private String phone; // New field, no migration needed

Lily stored multiple versions of the same record, allowing old code to read new data and vice versa. This was critical for rolling deployments in large organizations.

Production Patterns and Anti-Patterns

Bruno Guedes shared war stories from production:
Hotspot avoidance: Never use auto-incrementing IDs. Use hashed or UUID-based keys.
Index explosion: @LilyFullText on large fields → Solr bloat. Use @LilyField(indexed = true) for structured search.
Compaction storms: Schedule major compactions during low traffic.
Zookeeper tuning: Increase tick time for large clusters.

The Lily Ecosystem in 2012

Lily shipped with:
Lily CLI for schema inspection and cluster management.
Lily Maven Plugin for deploying schemas.
Lily SolrCloud Integration with automatic sharding.
Lily Kafka Connect for streaming data ingestion.

Lily’s Legacy After 2018: Where the Ideas Live On

EDIT
Although Lily itself was archived in 2018, its core concepts continue to thrive in modern tools.

The original HBase POJO mapping is now embodied in Spring Data Hadoop.

Lily’s Solr integration has evolved into SolrJ + OpenSearch.

The repository pattern that Lily pioneered is carried forward by Spring Data R2DBC.

Schema evolution, once a key Lily feature, is now handled by Apache Atlas.

Finally, Lily’s near-real-time search capability lives on through the Elasticsearch Percolator.

Conclusion: Big Data Doesn’t Have to Be Hard

Steven Noels closed with a powerful message:

“Big data is not about MapReduce. It’s not about Zookeeper. It’s about solving business problems at scale. Lily proved that Java developers can do that—without becoming data engineers.”

EDIT:
In 2025, as lakehouse architectures, real-time analytics, and AI-driven search dominate, Lily’s vision of big data as a first-class Java citizen remains more relevant than ever.

Links

PostHeaderIcon [DevoxxFR2013] MongoDB and Mustache: Toward the Death of the Cache? A Comprehensive Case Study in High-Traffic, Real-Time Web Architecture

Lecturers

Mathieu Pouymerol and Pierre Baillet were the technical backbone of Fotopedia, a photo-sharing platform that, at its peak, served over five million monthly visitors using a Ruby on Rails application that had been in production for six years. Mathieu, armed with degrees from École Centrale Paris and a background in building custom data stores for dictionary publishers, brought a deep understanding of database design, indexing, and performance optimization. Pierre, also from Centrale and with experience at Cambridge, had spent nearly a decade managing infrastructure, tuning Tomcat, configuring memcached, and implementing geoDNS systems. Together, they faced the ultimate challenge: keeping a legacy Rails monolith responsive under massive, unpredictable traffic while maintaining content freshness and developer velocity.

Abstract

This article presents an exhaustively detailed expansion of Mathieu Pouymerol and Pierre Baillet’s 2012 DevoxxFR presentation, “MongoDB et Mustache, vers la mort du cache ?”, reimagined as a definitive case study in high-traffic web architecture and the evolution of caching strategies. The Fotopedia team inherited a Rails application plagued by slow ORM queries, complex cache invalidation logic, and frequent stale data. Their initial response—edge-side includes (ESI), fragment caching, and multi-layered memcached—bought time but introduced fragility and operational overhead. The breakthrough came from a radical rethinking: use MongoDB as a real-time document store and Mustache as a logic-less templating engine to assemble pages dynamically, eliminating cache for the most volatile content.

This analysis walks through every layer of their architecture: from database schema design to template composition, from CDN integration to failure mode handling. It includes performance metrics, post-mortem analyses, and lessons learned from production incidents. Updated for 2025, it maps their approach to modern tools: MongoDB 7.0 with Atlas, server-side rendering with HTMX, edge computing via Cloudflare Workers, and Spring Boot with Mustache, offering a complete playbook for building cache-minimized, real-time web applications at scale.

The Legacy Burden: A Rails Monolith Under Siege

Fotopedia’s core application was built on Ruby on Rails 2.3, a framework that, while productive for startups, began to show its age under heavy load. The database layer relied on MySQL with aggressive sharding and replication, but ActiveRecord queries were slow, and joins across shards were impractical. The presentation layer used ER 15–20 partials per page, each with its own caching logic. The result was a cache dependency graph so complex that a single user action—liking a photo—could invalidate dozens of cache keys across multiple servers.

The team’s initial strategy was defense in depth:
Varnish at the edge with ESI for including dynamic fragments.
Memcached for fragment and row-level caching.
Custom invalidation daemons to purge stale cache entries.

But this created a house of cards. A missed invalidation led to stale comments. A cache stampede during a traffic spike brought the database to its knees. As Pierre put it, “We were not caching to improve performance. We were caching to survive.”

The Paradigm Shift: Real-Time Data with MongoDB

The turning point came when the team migrated dynamic, user-generated content—photos, comments, tags, likes—to MongoDB. Unlike MySQL, MongoDB stored data as flexible JSON-like documents, allowing embedded arrays and atomic updates:

{
  "_id": "photo_123",
  "title": "Sunset",
  "user_id": "user_456",
  "tags": ["paris", "sunset"],
  "likes": 1234,
  "comments": [
    { "user": "Alice", "text": "Gorgeous!", "timestamp": "2013-04-01T12:00:00Z" }
  ]
}

This schema eliminated joins and enabled single-document reads for most pages. Updates used atomic operators:

db.photos.updateOne(
  { _id: "photo_123" },
  { $inc: { likes: 1 }, $push: { comments: { user: "Bob", text: "Nice!" } } }
);

Indexes on user_id, tags, and timestamp ensured sub-millisecond query performance.

Mustache: The Logic-Less Templating Revolution

The second pillar was Mustache, a templating engine that enforced separation of concerns by allowing no logic in templates—only iteration and conditionals:

{{#photo}}
  <h1>{{title}}</h1>
  <img src="{{url}}" alt="{{title}}" />
  <p>By {{user.name}} • {{likes}} likes</p>
  <ul class="comments">
    {{#comments}}
      <li><strong>{{user}}</strong>: {{text}}</li>
    {{/comments}}
  </ul>
{{/photo}}

Because templates contained no business logic, they could be cached indefinitely in Varnish. Only the data changed—and that came fresh from MongoDB on every request.

data = mongo.photos.find(_id: params[:id]).first
html = Mustache.render(template, data)

The Hybrid Architecture: Cache Where It Makes Sense

The final system was a hybrid of caching and real-time rendering:
Static assets (CSS, JS, images) → CDN with long TTL.
Static page fragments (headers, footers, sidebars) → Varnish ESI with 1-hour TTL.
Dynamic content (photo, comments, likes) → MongoDB + Mustache, no cache.

This reduced cache invalidation surface by 90% and average response time from 800ms to 180ms.

2025: The Evolution of Cache-Minimized Architecture

EDIT:
The principles pioneered by Fotopedia are now mainstream:
Server-side rendering with HTMX for dynamic updates.
Edge computing with Cloudflare Workers to assemble pages.
MongoDB Atlas with change streams for real-time UIs.
Spring Boot + Mustache for Java backends.

Links

PostHeaderIcon [DevoxxBE2013] CQRS for Great Good

Oliver Wolf, principal consultant and executive board member at INNOQ, challenges conventional architectures with CQRS (Command-Query Responsibility Segregation). A SOA and Java expert, Oliver traces CQRS’s evolution from CQS, demonstrating incremental adoption—from read-write separation to event sourcing. His session, enriched with examples, equips developers to rethink data flows, optimizing for asymmetric workloads in banking and beyond.

CQRS decouples commands (writes) from queries (reads), enabling tailored models. Oliver illustrates phased implementation, culminating in event-sourced systems for auditability and scalability.

From CQS to CQRS: Foundational Concepts

Oliver recalls CQS—Bertrand Meyer’s principle segregating mutators from inspectors. CQRS extends this, allowing distinct read/write models. He demos a simple e-commerce app, splitting a unified model into command (order placement) and query (inventory views).

This separation, Oliver explains, resolves impedance mismatches, enhancing performance.

Incremental Adoption Strategies

Phased rollout minimizes risk: start with asymmetric databases, Oliver advises, using separate stores for reads/writes. He showcases materialized views, syncing via background jobs.

Advanced steps introduce event sourcing: commands emit events, replayed for state reconstruction, ensuring immutability.

Event Sourcing and Distribution

Event sourcing captures changes as immutable logs, Oliver illustrates, rebuilding state on demand. Distribution follows: client/server variants, with web frontends querying dedicated services.

In banking, Oliver notes, CQRS optimizes configurable systems, balancing risk with extensibility.

Guidelines for Application

Oliver urges starting small: identify read-heavy operations, segregate gradually. Avoid over-engineering; CQRS suits complex domains, not simple CRUD.

Community examples, he shares, validate phased approaches, with INNOQ projects exploring hybrid models.

Links: