Posts Tagged ‘BigData’
[DevoxxFR2025] Spark 4 and Iceberg: The New Standard for All Your Data Projects
The world of big data is constantly evolving, with new technologies emerging to address the challenges of managing and processing ever-increasing volumes of data. Apache Spark has long been a dominant force in big data processing, and its evolution continues with Spark 4. Complementing this is Apache Iceberg, a modern table format that is rapidly becoming the standard for managing data lakes. Pierre Andrieux from Capgemini and Houssem Chihoub from Databricks joined forces to demonstrate how the combination of Spark 4 and Iceberg is set to revolutionize data projects, offering improved performance, enhanced data management capabilities, and a more robust foundation for data lakes.
Spark 4: Boosting Performance and Data Lake Support
Pierre and Houssem highlighted the major new features and enhancements in Apache Spark 4. A key area of improvement is performance, with a new query engine and automatic query optimization designed to accelerate data processing workloads. Spark 4 also brings enhanced native support for data lakes, simplifying interactions with data stored in formats like Parquet and ORC on distributed file systems. This tighter integration improves efficiency and reduces the need for external connectors or complex configurations. The presentation showcased benchmarks or performance comparisons illustrating the gains achieved with Spark 4, particularly when working with large datasets in a data lake environment.
Apache Iceberg Demystified: A Next-Generation Table Format
Apache Iceberg addresses the limitations of traditional table formats used in data lakes. Houssem demystified Iceberg, explaining that it provides a layer of abstraction on top of data files, bringing database-like capabilities to data lakes. Key features of Iceberg include:
– Time Travel: The ability to query historical snapshots of a table, enabling reproducible reports and simplified data rollbacks.
– Schema Evolution: Support for safely evolving table schemas over time (e.g., adding, dropping, or renaming columns) without requiring costly data rewrites.
– Dynamic Partitioning: Iceberg automatically manages data partitioning, optimizing query performance based on query patterns without manual intervention.
– Atomic Commits: Ensures that changes to a table are atomic, providing reliability and consistency even in distributed environments.
These features solve many of the pain points associated with managing data lakes, such as schema management complexities, difficulty in handling updates and deletions, and lack of transactionality.
The Power of Combination: Spark 4 and Iceberg
The true power lies in combining the processing capabilities of Spark 4 with the data management features of Iceberg. Pierre and Houssem demonstrated through concrete use cases and practical demonstrations how this combination enables building modern data pipelines. They showed how Spark 4 can efficiently read from and write to Iceberg tables, leveraging Iceberg’s features like time travel for historical analysis or schema evolution for seamlessly integrating data with changing structures. The integration allows data engineers and data scientists to work with data lakes with greater ease, reliability, and performance, making this combination a compelling new standard for data projects. The talk covered best practices for implementing data pipelines with Spark 4 and Iceberg and discussed potential pitfalls to avoid, providing attendees with the knowledge to leverage these technologies effectively in their own data initiatives.
Links:
- Pierre Andrieux: https://www.linkedin.com/in/pierre-andrieux-33135810b/
- Houssem Chihoub: https://www.linkedin.com/in/houssemchihoub/
- Capgemini: https://www.capgemini.com/
- Databricks: https://www.databricks.com/
- Apache Spark: https://spark.apache.org/
- Apache Iceberg: https://iceberg.apache.org/
- Devoxx France LinkedIn: https://www.linkedin.com/company/devoxx-france/
- Devoxx France Bluesky: https://bsky.app/profile/devoxx.fr
- Devoxx France Website: https://www.devoxx.fr/
[DevoxxPL2022] Accelerating Big Data: Modern Trends Enable Product Analytics • Boris Trofimov
Boris Trofimov, a big data expert from Sigma Software, delivered an insightful presentation at Devoxx Poland 2022, exploring modern trends in big data that enhance product analytics. With experience building high-load systems like the AOL data platform for Verizon Media, Boris provided a comprehensive overview of how data platforms are evolving. His talk covered architectural innovations, data governance, and the shift toward serverless and ELT (Extract, Load, Transform) paradigms, offering actionable insights for developers navigating the complexities of big data.
The Evolving Role of Data Platforms
Boris began by demystifying big data, often misconstrued as a magical solution for business success. He clarified that big data resides within data platforms, which handle ingestion, processing, and analytics. These platforms typically include data sources, ETL (Extract, Transform, Load) pipelines, data lakes, and data warehouses. Boris highlighted the growing visibility of big data beyond its traditional boundaries, with data engineers playing increasingly critical roles. He noted the rise of cross-functional teams, inspired by Martin Fowler’s ideas, where subdomains drive team composition, fostering collaboration between data and backend engineers.
The convergence of big data and backend practices was a key theme. Boris pointed to technologies like Apache Kafka and Spark, which are now shared across both domains, enabling mutual learning. He emphasized that modern data platforms must balance complexity with efficiency, requiring specialized expertise to avoid pitfalls like project failures due to inadequate practices.
Architectural Innovations: From Lambda to Delta
Boris delved into big data architectures, starting with the Lambda architecture, which separates data processing into speed (real-time) and batch layers for high availability. While effective, Lambda’s complexity increases development and maintenance costs. As an alternative, he introduced the Kappa architecture, which simplifies processing by using a single streaming layer, reducing latency but potentially sacrificing availability. Boris then highlighted the emerging Delta architecture, which leverages data lakehouses—hybrid systems combining data lakes and warehouses. Technologies like Snowflake and Databricks support Delta, minimizing data hops and enabling both batch and streaming workloads with a single storage layer.
The Delta architecture’s rise reflects the growing popularity of data lakehouses, which Boris praised for their ability to handle raw, processed, and aggregated data efficiently. By reducing technological complexity, Delta enables faster development and lower maintenance, making it a compelling choice for modern data platforms.
Data Mesh and Governance
Boris introduced data mesh as a response to monolithic data architectures, drawing parallels with domain-driven design. Data mesh advocates for breaking down data platforms into bounded contexts, each owned by a dedicated team responsible for its pipelines and decisions. This approach avoids the pitfalls of monolithic pipelines, such as chaotic dependencies and scalability issues. Boris outlined four “temptations” to avoid: building monolithic pipelines, combining all pipelines into one application, creating chaotic pipeline networks, and mixing domains in data tables. Data mesh, he argued, promotes modularity and ownership, treating data as a product.
Data governance, or “data excellence,” was another critical focus. Boris stressed the importance of practices like data monitoring, quality validation, and retention policies. He advocated for a proactive approach, where engineers address these concerns early to ensure platform reliability and cost-efficiency. By treating data governance as a checklist, teams can mitigate risks and enhance platform maturity.
Serverless and ELT: Simplifying Big Data
Boris highlighted the shift toward serverless technologies and ELT paradigms. Serverless solutions, available across transformation, storage, and analytics tiers, reduce infrastructure management burdens, allowing faster time-to-market. He cited AWS and other cloud providers as enablers, noting that while not always cost-effective, serverless minimizes maintenance efforts. Similarly, ELT—where transformation occurs after loading data into a warehouse—leverages modern databases like Snowflake and BigQuery. Unlike traditional ETL, ELT reduces latency and complexity by using database capabilities for transformations, making it ideal for early-stage projects.
Boris also noted the resurgence of SQL as a domain-specific language across big data tiers, from transformation to governance. By building frameworks that express business logic in SQL, developers can accelerate feature delivery, despite SQL’s perceived limitations. He emphasized that well-designed SQL queries can be powerful, provided engineers avoid poorly structured code.
Productizing Big Data and Business Intelligence
The final trend Boris explored was the productization of big data solutions. He likened this to Intel’s microprocessor revolution, where standardized components accelerated hardware development. Companies like Absorber offer “data platform as a service,” enabling rapid construction of data pipelines through drag-and-drop interfaces. While limited for complex use cases, such solutions cater to organizations seeking quick deployment. Boris also discussed the rise of serverless business intelligence (BI) tools, which support ELT and allow cross-cloud data queries. These tools, like Mode and Tableau, enable self-service analytics, reducing the need for custom platforms in early stages.
Links:
[DevoxxPL2022] Before It’s Too Late: Finding Real-Time Holes in Data • Chayim Kirshen
Chayim Kirshen, a veteran of the startup ecosystem and client manager at Redis, captivated audiences at Devoxx Poland 2022 with a dynamic exploration of real-time data pipeline challenges. Drawing from his experience with high-stakes environments, including a 2010 stock exchange meltdown, Chayim outlined strategies to ensure data integrity and performance in large-scale systems. His talk provided actionable insights for developers, emphasizing the importance of storing raw data, parsing in real time, and leveraging technologies like Redis to address data inconsistencies.
The Perils of Unclean Data
Chayim began with a stark reality: data is rarely clean. Recounting a 2010 incident where hackers compromised a major stock exchange’s API, he highlighted the cascading effects of unreliable data on real-time markets. Data pipelines face issues like inconsistent formats (CSV, JSON, XML), changing sources (e.g., shifting API endpoints), and service reliability, with modern systems often tolerating over a thousand minutes of downtime annually. These challenges disrupt real-time processing, critical for applications like stock exchanges or ad bidding networks requiring sub-100ms responses. Chayim advocated treating data as programmable code, enabling developers to address issues systematically rather than reactively.
Building Robust Data Pipelines
To tackle these issues, Chayim proposed a structured approach to data pipeline design. Storing raw data indefinitely—whether in S3, Redis, or other storage—ensures a fallback for reprocessing. Parsing data in real time, using defined schemas, allows immediate usability while preserving raw inputs. Bulk changes, such as SQL bulk inserts or Redis pipelines, reduce network overhead, critical for high-throughput systems. Chayim emphasized scheduling regular backfills to re-import historical data, ensuring consistency despite source changes. For example, a stock exchange’s ticker symbol updates (e.g., Fitbit to Google) require ongoing reprocessing to maintain accuracy. Horizontal scaling, using disposable nodes, enhances availability and resilience, avoiding single points of failure.
Real-Time Enrichment and Redis Integration
Data enrichment, such as calculating stock bid-ask spreads or market cap changes, should occur post-ingestion to avoid slowing the pipeline. Chayim showcased Redis, particularly its Gears and JSON modules, for real-time data processing. Redis acts as a buffer, storing raw JSON and replicating it to traditional databases like PostgreSQL or MySQL. Using Redis Gears, developers can execute functions within the database, minimizing network costs and enabling rapid enrichment. For instance, calculating a stock’s daily percentage change can run directly in Redis, streamlining analytics. Chayim highlighted Python-based tools like Celery for scheduling backfills and enrichments, allowing asynchronous processing and failure retries without disrupting the main pipeline.
Scaling and Future-Proofing
Chayim stressed horizontal scaling to distribute workloads geographically, placing data closer to users for low-latency access, as seen in ad networks. By using Redis for real-time writes and offloading to workers via Celery, developers can manage millions of daily entries, such as stock ticks, without performance bottlenecks. Scheduled backfills address data gaps, like API schema changes (e.g., integer to string conversions), by reprocessing raw data. This approach, combined with infrastructure-as-code tools like Terraform, ensures scalability and adaptability, allowing organizations to focus on business logic rather than data management overhead.
Links:
[ScalaDaysNewYork2016] Large-Scale Graph Analysis with Scala and Akka
Ben Fonarov, a Big Data specialist at Capital One, presented a compelling case study at Scala Days New York 2016 on building a large-scale graph analysis engine using Scala, Akka, and HBase. Ben detailed the architecture and implementation of Athena, a distributed time-series graph system designed to deliver integrated, real-time data to enterprise users, addressing the challenges of data overload in a banking environment.
Addressing Enterprise Data Needs
Ben Fonarov opened by outlining the motivation behind Athena: the need to provide integrated, real-time data to users at Capital One. Unlike traditional table-based thinking, Athena represents data as a graph, modeling entities like accounts and transactions to align with business concepts. Ben highlighted the challenges of data overload, with multiple data warehouses and ETL processes generating vast datasets. Athena’s visual interface allows users to define graph schemas, ensuring data is accessible in a format that matches their mental models.
Architectural Considerations
Ben described two architectural approaches to building Athena. The naive implementation used a single actor to process queries, which was insufficient for production-scale loads. The robust solution leveraged an Akka cluster, distributing query processing across nodes for scalability. A query parser translated user requests into graph traversals, while actors managed tasks and streamed results to users. This design ensured low latency and scalability, handling up to 200 billion nodes efficiently.
Streaming and Optimization
A key feature of Athena, Ben explained, was its ability to stream results in real time, avoiding the batch processing limitations of frameworks like TinkerPop’s Gremlin. By using Akka’s actor-based concurrency, Athena processes queries incrementally, delivering results as they are computed. Ben discussed optimizations, such as limiting the number of nodes per actor to prevent bottlenecks, and plans to integrate graph algorithms like PageRank to enhance analytical capabilities.
Future Directions and Community Engagement
Ben concluded by sharing future plans for Athena, including adopting a Gremlin-like DSL for graph traversals and integrating with tools like Spark and H2O. He emphasized the importance of community feedback, inviting developers to join Capital One’s data team to contribute to Athena’s evolution. Running on AWS EC2, Athena represents a scalable solution for enterprise graph analysis, poised to transform how banks handle complex data relationships.
Links:
[ScalaDaysNewYork2016] Spark 2.0: Evolving Big Data Processing with Structured APIs
Apache Spark, a cornerstone in big data processing, has significantly shaped the landscape of distributed computing with its functional programming paradigm rooted in Scala. In a keynote address at Scala Days New York 2016, Matei Zaharia, the creator of Spark, elucidated the evolution of Spark’s APIs, culminating in the transformative release of Spark 2.0. This presentation highlighted how Spark has progressed from its initial vision of a unified engine to a more sophisticated platform with structured APIs like DataFrames and Datasets, enabling enhanced performance and usability for developers worldwide.
The Genesis of Spark’s Vision
Spark was conceived with two primary ambitions: to create a unified engine capable of handling diverse big data workloads and to offer a concise, language-integrated API that mirrors working with local data collections. Matei explained that unlike the earlier MapReduce model, which was groundbreaking yet limited, Spark extended its capabilities to support iterative computations, streaming, and interactive data exploration. This unification was critical, as prior to Spark, developers often juggled multiple specialized systems, each with its own complexities, making integration cumbersome. By leveraging Scala’s functional constructs, Spark introduced Resilient Distributed Datasets (RDDs), allowing developers to perform operations like map, filter, and join with ease, abstracting the complexities of distributed computing.
The success of this vision is evident in Spark’s widespread adoption. With over a thousand organizations deploying it, including on clusters as large as 8,000 nodes, Spark has become the most active open-source big data project. Its libraries for SQL, streaming, machine learning, and graph processing have been embraced, with 75% of surveyed organizations using multiple components, demonstrating the power of its unified approach.
Challenges with the Functional API
Despite its strengths, the original RDD-based API presented challenges, particularly in optimization and efficiency. Matei highlighted that the functional API, while intuitive, conceals the semantics of computations, making it difficult for the engine to optimize operations automatically. For instance, operations like groupByKey can lead to inefficient memory usage, as they materialize large intermediate datasets unnecessarily. This issue is exemplified in a word count example where groupByKey creates a sequence of values before summing them, consuming excessive memory when a simpler reduceByKey could suffice.
Moreover, the reliance on Java objects for data storage introduces significant memory overhead. Matei illustrated this with a user class example, where headers, pointers, and padding consume roughly two-thirds of the allocated memory, a critical concern for in-memory computing frameworks like Spark. These challenges underscored the need for a more structured approach to data processing.
Introducing Structured APIs: DataFrames and Datasets
To address these limitations, Spark introduced DataFrames and Datasets, structured APIs built atop the Spark SQL engine. These APIs impose a defined schema on data, enabling the engine to understand and optimize computations more effectively. DataFrames, dynamically typed, resemble tables in a relational database, supporting operations like filtering and aggregation through a domain-specific language (DSL). Datasets, statically typed, extend this concept by aligning closely with Scala’s type system, allowing developers to work with case classes for type safety.
Matei demonstrated how DataFrames enable declarative programming, where operations are expressed as logical plans that Spark optimizes before execution. For example, filtering users by state generates an abstract syntax tree, allowing Spark to optimize the query plan rather than executing operations eagerly. This declarative nature, inspired by data science tools like Pandas, distinguishes Spark’s DataFrames from similar APIs in R and Python, enhancing performance through lazy evaluation and optimization.
Optimizing Performance with Project Tungsten
A significant focus of Spark 2.0 is Project Tungsten, which addresses the shifting bottlenecks in big data systems. Matei noted that while I/O was the primary constraint in 2010, advancements in storage (SSDs) and networking (10-40 gigabit) have shifted the focus to CPU efficiency. Tungsten employs three strategies: runtime code generation, cache locality exploitation, and off-heap memory management. By encoding data in a compact binary format, Spark reduces memory overhead compared to Java objects. Code generation, facilitated by the Catalyst optimizer, produces specialized bytecode that operates directly on binary data, improving CPU performance. These optimizations ensure Spark can leverage modern hardware trends, delivering significant performance gains.
Structured Streaming: A Unified Approach to Real-Time Processing
Spark 2.0 introduces structured streaming, a high-level API that extends the benefits of DataFrames and Datasets to streaming computations. Matei emphasized that real-world streaming applications often involve batch and interactive workloads, such as updating a database for a web application or applying a machine learning model. Structured streaming treats streams as infinite DataFrames, allowing developers to use familiar APIs to define computations. The engine then incrementally executes these plans, maintaining state and handling late data efficiently. For instance, a batch job grouping data by user ID can be adapted to streaming by changing the input source, with Spark automatically updating results as new data arrives.
This approach simplifies the development of continuous applications, enabling seamless integration of streaming, batch, and interactive processing within a single API, a capability that sets Spark apart from other streaming engines.
Future Directions and Community Engagement
Looking ahead, Matei outlined Spark’s commitment to evolving its APIs while maintaining compatibility. The structured APIs will serve as the foundation for new libraries, facilitating interoperability across languages like Python and R. Additionally, Spark’s data source API allows applications to seamlessly switch between storage systems like Hive, Cassandra, or JSON, enhancing flexibility. Matei also encouraged community participation, noting that Databricks offers a free Community Edition with tutorials to help developers explore Spark’s capabilities.
Links:
[DevoxxFR2014] Apache Spark: A Unified Engine for Large-Scale Data Processing
Lecturer
Patrick Wendell serves as a co-founder of Databricks and stands as a core contributor to Apache Spark. He previously worked as an engineer at Cloudera. Patrick possesses extensive experience in distributed systems and big data frameworks. He earned a degree from Princeton University. Patrick has played a pivotal role in transforming Spark from a research initiative at UC Berkeley’s AMPLab into a leading open-source platform for data analytics and machine learning.
Abstract
This article thoroughly examines Apache Spark’s architecture as a unified engine that handles batch processing, interactive queries, streaming data, and machine learning workloads. The discussion delves into the core abstractions of Resilient Distributed Datasets (RDDs), DataFrames, and Datasets. It explores key components such as Spark SQL, MLlib, and GraphX. Through detailed practical examples, the analysis highlights Spark’s in-memory computation model, its fault tolerance mechanisms, and its seamless integration with Hadoop ecosystems. The article underscores Spark’s profound impact on building scalable and efficient data workflows in modern enterprises.
The Genesis of Spark and the RDD Abstraction
Apache Spark originated to overcome the shortcomings of Hadoop MapReduce, especially its heavy dependence on disk-based storage for intermediate results. This disk-centric approach severely hampered performance in iterative algorithms and interactive data exploration. Spark introduces Resilient Distributed Datasets (RDDs), which are immutable, partitioned collections of objects that support in-memory computations across distributed clusters.
RDDs possess five defining characteristics. First, they maintain a list of partitions that distribute data across nodes. Second, they provide a function to compute each partition based on its parent data. Third, they track dependencies on parent RDDs to enable lineage-based recovery. Fourth, they optionally include partitioners for key-value RDDs to control data placement. Fifth, they specify preferred locations to optimize data locality and reduce network shuffling.
This lineage-based fault tolerance mechanism eliminates the need for data replication. When a partition becomes lost due to node failure, Spark reconstructs it by replaying the sequence of transformations recorded in the dependency graph. For instance, consider loading a log file and counting error occurrences:
val logFile = sc.textFile("hdfs://logs/access.log")
val errors = logFile.filter(line => line.contains("error")).count()
Here, the filter transformation builds a logical plan lazily, while the count action triggers the actual computation. This lazy evaluation strategy allows Spark to optimize the entire execution plan, minimizing unnecessary data movement and improving resource utilization.
Evolution to Structured Data: DataFrames and Datasets
Spark 1.3 introduced DataFrames, which represent tabular data with named columns and leverage the Catalyst optimizer for query planning. DataFrames build upon RDDs but add schema information and enable relational-style operations through Spark SQL. Developers can execute ANSI-compliant SQL queries directly:
SELECT user, COUNT(*) AS visits
FROM logs
GROUP BY user
ORDER BY visits DESC
The Catalyst optimizer applies sophisticated rule-based and cost-based optimizations, such as pushing filters down to the data source, pruning unnecessary columns, and reordering joins for efficiency. Spark 1.6 further advanced the abstraction layer with Datasets, which combine the type safety of RDDs with the optimization capabilities of DataFrames:
case class LogEntry(user: String, timestamp: Long, action: String)
val ds: Dataset[LogEntry] = logRdd.as[LogEntry]
ds.groupBy("user").count().show()
This unified API allows developers to work with structured and unstructured data using a single programming model. It significantly reduces the cognitive overhead of switching between different paradigms for batch processing and real-time analytics.
The Component Ecosystem: Specialized Libraries
Spark’s modular design incorporates several high-level libraries that address specific workloads while sharing the same underlying engine.
Spark SQL serves as a distributed SQL engine. It executes HiveQL and ANSI SQL on DataFrames. The library integrates seamlessly with the Hive metastore and supports JDBC/ODBC connections for business intelligence tools.
MLlib delivers a scalable machine learning library. It implements algorithms such as logistic regression, decision trees, k-means clustering, and collaborative filtering. The ML Pipeline API standardizes feature extraction, transformation, and model evaluation:
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(trainingData)
GraphX extends the RDD abstraction to graph-parallel computation. It provides primitives for PageRank, connected components, and triangle counting using a Pregel-like API.
Spark Streaming enables real-time data processing through micro-batching. It treats incoming data streams as a continuous series of small RDD batches:
val lines = ssc.socketTextStream("localhost", 9999)
val wordCounts = lines.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKeyAndWindow(_ + _, Minutes(5))
This approach supports stateful stream processing with exactly-once semantics and integrates with Kafka, Flume, and Twitter.
Performance Optimizations and Operational Excellence
Spark achieves up to 100x performance gains over MapReduce for iterative workloads due to its in-memory processing model. Key optimizations include:
- Project Tungsten: This initiative introduces whole-stage code generation and off-heap memory management to minimize garbage collection overhead.
- Adaptive Query Execution: Spark dynamically re-optimizes queries at runtime based on collected statistics.
- Memory Management: The unified memory manager dynamically allocates space between execution and storage.
Spark operates on YARN, Mesos, Kubernetes, or its standalone cluster manager. The driver-executor architecture centralizes scheduling while distributing computation, ensuring efficient resource utilization.
Real-World Implications and Enterprise Adoption
Spark’s unified engine eliminates the need for separate systems for ETL, SQL analytics, streaming, and machine learning. This consolidation reduces operational complexity and training costs. Data teams can use a single language—Scala, Python, Java, or R—across the entire data lifecycle.
Enterprises leverage Spark for real-time fraud detection, personalized recommendations, and predictive maintenance. Its fault-tolerant design and active community ensure reliability in mission-critical environments. As data volumes grow exponentially, Spark’s ability to scale linearly on commodity hardware positions it as a cornerstone of modern data architectures.
Links:
[DevoxxFR2014] Architecture and Utilization of Big Data at PagesJaunes
Lecturer
Jean-François Paccini serves as the Chief Technology Officer (CTO) at PagesJaunes Groupe, overseeing technological strategies for local information services. His leadership has driven the integration of big data technologies to enhance data processing and user experience in digital products.
Abstract
This article analyzes the strategic adoption of big data technologies at PagesJaunes, from initial convictions to practical implementations. It examines the architecture for audience data collection, innovative applications like GeoLive for real-time visualization, and machine learning for search relevance, while projecting future directions and implications for business intelligence.
Strategic Convictions and Initial Architecture
PagesJaunes, part of a group including Mappy and other local service entities, has transitioned to predominantly digital revenue, generating 70% of its 2014 turnover online. This shift produces abundant data from user interactions—over 140 million monthly searches, 3 million reviews, and nearly 1 million mobile visits—offering insights into user behavior adaptable in real-time.
The conviction driving big data adoption is the untapped value in this data “gold mine,” combined with accessible technologies like Hadoop. Rather than responding to specific business demands, the initiative stemmed from technological foresight: proving potential through modest investments in open-source tools and commodity hardware.
The initial opportunity arose from refactoring the audience data collection chain, traditionally handling web server logs, application metrics, and mobile data via batch scripts and a columnar database. Challenges included delays (often J+2 to J+4) and error recovery issues. The new architecture employs Flume collectors feeding a Hadoop cluster of about 50 nodes, storing 10 terabytes and processing 75 gigabytes daily—costing far less than legacy systems.
Innovative Applications: GeoLive and Beyond
To demonstrate value, the team developed GeoLive during an internal innovation contest, visualizing real-time searches on a French map. Each flashing point represents a query, delayed by about five minutes, illustrating media ubiquity across territories. Categories like “psychologist” or “dermatologist” highlight local concerns.
GeoLive created a “wow effect,” winning the contest and gaining executive enthusiasm. Industrialized for the company lobby and sales tools, it tangibly showcases search volume and coverage, shifting perceptions from abstract metrics to visual impact.
Building on this, big data extended to core operations via machine learning for search relevance. Users often seek products or services ambiguously (e.g., “rice in Marseille” yielding funeral rites instead of food retailers). Traditional analysis covered only top 10,000 queries manually; Hadoop enables exhaustive session examination, identifying weak queries through reformulations.
Tools like Hive and custom developments, aided by a data scientist, model query fragility. This loop informs indexers to refine rules, detecting missing professionals and enhancing results continuously.
Future Projections and Organizational Impact
Looking forward, PagesJaunes aims to industrialize A/B testing for algorithm variants, real-time user segmentation, and fraud detection (e.g., scraping bots). Data journalism will leverage regional trends for insights.
Predictions include 90% of data intelligence projects adopting these technologies within 18 months, with Hadoop potentially replacing the corporate data warehouse for audience analytics. This evolution demands data scientist roles for sophisticated modeling, avoiding naive correlations.
The journey underscores big data’s role in fostering innovation, as seen in the “Make It” contest energizing cross-functional teams. Such events reveal creative potential, leading to production implementations and cultural shifts toward agility.
Implications for Digital Transformation
Big data at PagesJaunes exemplifies how convictions in data value and technology accessibility can drive transformation. From modest clusters to mission-critical applications, it enhances user experience and operational efficiency. Challenges like tool maturity for non-technical analysts persist, but evolving ecosystems promise broader accessibility.
Ultimately, this approach positions PagesJaunes to personalize experiences, introduce affinity services, and maintain competitiveness in local search, illustrating big data’s strategic imperative in digital economies.
Links:
[DevoxxFR2014] Cassandra: Entering a New Era in Distributed Databases
Lecturer
Jonathan Ellis is the project chair of Apache Cassandra and co-founder of DataStax (formerly Riptano), a company providing professional support for Cassandra. With over five years of experience working on Cassandra, starting from its origins at Facebook, Jonathan has been instrumental in evolving it from a specialized system into a general-purpose distributed database. His expertise lies in high-performance, scalable data systems, and he frequently speaks on topics related to NoSQL databases and big data technologies.
Abstract
This article explores the evolution and key features of Apache Cassandra as presented in a comprehensive overview of its design, applications, and recent advancements. It delves into Cassandra’s architecture for handling time-series data, multi-data center deployments, and distributed counters, while highlighting its integration with Hadoop and the introduction of lightweight transactions and CQL. The analysis underscores Cassandra’s strengths in performance, availability, and scalability, providing insights into its practical implications for modern applications and future developments.
Introduction to Apache Cassandra
Apache Cassandra, initially developed at Facebook in 2008, has rapidly evolved into a versatile distributed database system. Originally designed to handle the inbox messaging needs of a social media platform, Cassandra has transcended its origins to become a general-purpose solution applicable across various industries. This transformation is evident in its adoption by companies like eBay, Adobe, and Constant Contact, where it manages high-velocity data with demands for performance, availability, and scalability.
The core appeal of Cassandra lies in its ability to manage vast amounts of data across multiple nodes without a single point of failure. Unlike traditional relational databases, Cassandra employs a peer-to-peer architecture, ensuring that every node in the cluster is identical and capable of handling read and write operations. This design philosophy stems from the need to support applications that require constant uptime and the ability to scale horizontally by adding more commodity hardware.
In practical terms, Cassandra excels in scenarios involving time-series data, which includes sequences of data points indexed in time order. Examples range from Internet of Things (IoT) sensor readings to user activity logs in applications and financial transaction records. These data types benefit from Cassandra’s efficient storage and retrieval mechanisms, which prioritize chronological ordering and rapid ingestion rates.
Architectural Design and Data Distribution
At the heart of Cassandra’s architecture is its data distribution model, which uses consistent hashing to partition data across nodes. Each row in Cassandra is identified by a primary key, which is hashed using the Murmur3 algorithm to produce a 128-bit token. This token determines the node’s responsibility for storing the data, mapping keys to a virtual ring where nodes are assigned token ranges.
To enhance fault tolerance, Cassandra supports replication across multiple nodes. In a simple setup, replicas are placed by walking the ring clockwise, but production environments often employ rack-aware strategies to avoid placing multiple replicas on the same rack, mitigating risks from power or network failures. The introduction of virtual nodes (vnodes) in later versions allows each physical node to manage multiple token ranges, typically 256 per node, which balances load more evenly and simplifies cluster management.
Adding nodes to a cluster, known as bootstrapping, involves the new node randomly selecting tokens from existing nodes, followed by data streaming to transfer relevant partitions. This process occurs without service interruption, as existing nodes continue serving requests. Such mechanisms ensure linear scalability, where doubling the number of nodes roughly doubles the cluster’s capacity.
For multi-data center deployments, Cassandra optimizes cross-data center communication by sending updates to a single replica in the remote center, which then locally replicates the data. This approach minimizes bandwidth usage across expensive wide-area networks, making it suitable for hybrid environments combining on-premises data centers with cloud providers like AWS or Google Cloud.
Handling Distributed Counters and Integration with Analytics
One of Cassandra’s innovative features is its support for distributed counters, addressing the challenge of maintaining accurate counts in a replicated system. Traditional increment operations can lead to lost updates if concurrent clients overwrite each other’s changes. Cassandra resolves this by partitioning the counter value across replicas, where each replica maintains its own sub-counter. The total value is computed by summing these partitions during reads.
This design ensures eventual consistency while allowing high-throughput updates. For instance, if a counter starts at 3 and two replicas each increment by 2, the partitions update independently, and gossip protocols propagate the changes, resulting in a final value of 7 across all replicas.
Cassandra’s integration with Hadoop further extends its utility for analytical workloads. Beyond simple input formats for MapReduce jobs, Cassandra can partition a cluster into segments for operational workloads and others for analytics, automatically handling replication between them. This setup is ideal for recommendation systems, such as suggesting related products based on purchase history, where Hadoop computes correlations and replicates results back to the operational nodes.
Advancements in Transactions and Query Language
Prior to version 2.0, Cassandra lacked traditional transactions, relying on external lock managers like ZooKeeper for atomic operations. This approach introduced complexities, such as handling client failures during lock acquisition. To address this, Cassandra introduced lightweight transactions in version 2.0, enabling conditional inserts and updates using the Paxos consensus algorithm.
Paxos ensures fault-tolerant agreement among replicas, requiring four round trips per transaction, which increases latency. Thus, lightweight transactions are recommended sparingly, only when atomicity is critical, such as ensuring unique user account creation. The syntax integrates seamlessly with Cassandra Query Language (CQL), resembling SQL but omitting joins to maintain single-node query efficiency.
CQL, introduced in version 2.0, enhances developer productivity by providing a familiar interface for schema definition and querying. It supports collections (sets, lists, maps) for denormalization, avoiding the need for joins. Version 2.1 adds user-defined types and collection indexing, allowing nested structures and queries like selecting songs containing the tag “blues.”
Implications for Application Development
Cassandra’s design choices have profound implications for building resilient applications. Its emphasis on availability and partition tolerance aligns with the CAP theorem, prioritizing these over strict consistency in distributed settings. This makes it suitable for global applications where downtime is unacceptable.
For developers, features like triggers and virtual nodes reduce operational overhead, while CQL lowers the learning curve compared to thrift-based APIs. However, challenges remain, such as managing eventual consistency and avoiding overuse of transactions to preserve performance.
In production, companies like eBay leverage Cassandra for time-series data and multi-data center setups, citing its efficiency in bandwidth-constrained environments. Adobe uses it for audience management in the cloud, processing vast datasets with high availability.
Future Directions and Conclusion
Looking ahead, Cassandra continues to evolve, with version 2.1 introducing enhancements like new keywords for collection queries and improved indexing. The beta releases indicate stability, paving the way for broader adoption.
In conclusion, Cassandra represents a paradigm shift in database technology, offering scalable, high-performance solutions for modern data challenges. Its architecture, from consistent hashing to lightweight transactions, provides a robust foundation for applications demanding reliability across distributed environments. As organizations increasingly handle big data, Cassandra’s blend of simplicity and power positions it as a cornerstone for future innovations.
Links:
[DevoxxFR2013] Lily: Big Data for Dummies – A Comprehensive Journey into Democratizing Apache Hadoop and HBase for Enterprise Java Developers
Lecturers
Steven Noels stands as one of the most visionary figures in the evolution of open-source Java ecosystems, having co-founded Outerthought in the early 2000s with a mission to push the boundaries of content management, RESTful architecture, and scalable data systems. His flagship creation, Daisy CMS, became a cornerstone for large-scale, multilingual content platforms used by governments and global enterprises, demonstrating that Java could power mission-critical, document-centric applications at internet scale. But Noels’ ambition extended far beyond traditional CMS. Recognizing the seismic shift toward big data in the late 2000s, he pivoted Outerthought—and later NGDATA—toward building tools that would make the Apache Hadoop ecosystem accessible to the average enterprise Java developer. Lily, launched in 2010, was the culmination of this vision: a platform that wrapped the raw power of HBase and Solr into a cohesive, Java-friendly abstraction layer, eliminating the need for MapReduce expertise or deep systems programming.
Bruno Guedes, an enterprise Java architect at SFEIR with over a decade of experience in distributed systems and search infrastructure, brought the practitioner’s perspective to the stage. Having worked with Lily from its earliest alpha versions, Guedes had deployed it in production environments handling millions of records, integrating it with legacy Java EE applications, Spring-based services, and real-time analytics pipelines. His hands-on experience—debugging schema migrations, tuning SolrCloud clusters, and optimizing HBase compactions—gave him unique insight into both the promise and the pitfalls of big data adoption in conservative enterprise settings. Together, Noels and Guedes formed a perfect synergy: the visionary architect and the battle-tested engineer, delivering a presentation that was equal parts inspiration and practical engineering.
Abstract
This article represents an exhaustively elaborated, deeply extended, and comprehensively restructured expansion of Steven Noels and Bruno Guedes’ seminal 2012 DevoxxFR presentation, “Lily, Big Data for Dummies”, transformed into a definitive treatise on the democratization of big data technologies for the Java enterprise. Delivered in a bilingual format that reflected the global nature of the Apache community, the original talk introduced Lily as a groundbreaking platform that unified Apache HBase’s scalable, distributed storage with Apache Solr’s full-text search and analytics capabilities, all through a clean, type-safe Java API. The core promise was radical in its simplicity: enterprise Java developers could build petabyte-scale, real-time searchable data systems without writing a single line of MapReduce, without mastering Zookeeper quorum mechanics, and without abandoning the comforts of POJOs, annotations, and IDE autocompletion.
This expanded analysis delves far beyond the original demo to explore the philosophical foundations of Lily’s design, the architectural trade-offs in integrating HBase and Solr, the real-world production patterns that emerged from early adopters, and the lessons learned from scaling Lily to billions of records. It includes detailed code walkthroughs, performance benchmarks, schema evolution strategies, and failure mode analyses.
EDIT:
Updated for the 2025 landscape, this piece maps Lily’s legacy concepts to modern equivalents—Apache HBase 2.5, SolrCloud 9, OpenSearch, Delta Lake, Trino, and Spring Data Hadoop—while preserving the original vision of big data for the rest of us. Through rich narratives, architectural diagrams, and forward-looking speculation, this work serves not just as a historical archive, but as a practical guide for any Java team contemplating the leap into distributed, searchable big data systems.
The Big Data Barrier in 2012: Why Hadoop Was Hard for Java Developers
To fully grasp Lily’s significance, one must first understand the state of big data in 2012. The Apache Hadoop ecosystem—launched in 2006—was already a proven force in internet-scale companies like Yahoo, Facebook, and Twitter. HDFS provided fault-tolerant, distributed storage. MapReduce offered a programming model for batch processing. HBase, modeled after Google’s Bigtable, delivered random, real-time read/write access to massive datasets. And Solr, forked from Lucene, powered full-text search at scale.
Yet for the average enterprise Java developer, this stack was inaccessible. Writing a MapReduce job required:
– Learning a functional programming model in Java that felt alien to OO practitioners.
– Mastering job configuration, input/output formats, and partitioners.
– Debugging distributed failures across dozens of nodes.
– Waiting minutes to hours for job completion.
HBase, while promising real-time access, demanded:
– Manual row key design to avoid hotspots.
– Deep knowledge of compaction, splitting, and region server tuning.
– Integration with Zookeeper for coordination.
Solr, though more familiar, required:
– Separate schema.xml and solrconfig.xml files.
– Manual index replication and sharding.
– Complex commit and optimization strategies.
The result? Big data remained the domain of specialized data engineers, not the Java developers who built the business logic. Lily was designed to change that.
Lily’s Core Philosophy: Big Data as a First-Class Java Citizen
At its heart, Lily was built on a simple but powerful idea: big data should feel like any other Java persistence layer. Just as Spring Data made MongoDB, Cassandra, or Redis accessible via repositories and annotations, Lily aimed to make HBase and Solr feel like JPA with superpowers.
The Three Pillars of Lily
Steven Noels articulated Lily’s architecture in three interconnected layers:
-
The Storage Layer (HBase)
Lily used HBase as its primary persistence engine, storing all data as versioned, column-family-based key-value pairs. But unlike raw HBase, Lily abstracted away row key design, column family management, and versioning policies. Developers worked with POJOs, and Lily handled the mapping. -
The Indexing Layer (Solr)
Every mutation in HBase triggered an asynchronous indexing event to Solr. Lily maintained tight consistency between the two systems, ensuring that search results reflected the latest data within milliseconds. This was achieved through a message queue (Kafka or RabbitMQ) and idempotent indexing. -
The Java API Layer
The crown jewel was Lily’s type-safe, annotation-driven API. Developers defined their data model using plain Java classes:
@LilyRecord
public class Customer {
@LilyId
private String id;
@LilyField(family = "profile")
private String name;
@LilyField(family = "profile")
private int age;
@LilyField(family = "activity", indexed = true)
private List<String> recentSearches;
@LilyFullText
private String bio;
}
The @LilyRecord annotation told Lily to persist this object in HBase. @LilyField specified column families and indexing behavior. @LilyFullText triggered Solr indexing. No XML. No schema files. Just Java.
The Lily Repository: Spring Data, But for Big Data
Lily’s LilyRepository interface was modeled after Spring Data’s CrudRepository, but with big data superpowers:
public interface CustomerRepository extends LilyRepository<Customer, String> {
List<Customer> findByName(String name);
@Query("age:[* TO 30]")
List<Customer> findYoungCustomers();
@Query("bio:java AND recentSearches:hadoop")
List<Customer> findJavaHadoopEnthusiasts();
}
Behind the scenes, Lily:
– Translated method names to HBase scans.
– Converted @Query annotations to Solr queries.
– Executed searches across sharded SolrCloud clusters.
– Returned fully hydrated POJOs.
Bruno Guedes demonstrated this in a live demo:
CustomerRepository repo = lily.getRepository(CustomerRepository.class);
repo.save(new Customer("1", "Alice", 28, Arrays.asList("java", "hadoop"), "Java dev at NGDATA"));
List<Customer> results = repo.findJavaHadoopEnthusiasts();
The entire operation—save, index, search—took under 50ms on a 3-node cluster.
Under the Hood: How Lily Orchestrated HBase and Solr
Lily’s magic was in its orchestration layer. When a save() was called:
1. The POJO was serialized to HBase Put operations.
2. The mutation was written to HBase with a version timestamp.
3. A change event was published to a message queue.
4. A Solr indexer consumed the event and updated the search index.
5. Near-real-time consistency was guaranteed via HBase’s WAL and Solr’s soft commits.
For reads:
– findById → HBase Get.
– findByName → HBase scan with secondary index.
– @Query → Solr query with HBase post-filtering.
This dual-write, eventual consistency model was a deliberate trade-off for performance and scalability.
Schema Evolution and Versioning: The Enterprise Reality
One of Lily’s most enterprise-friendly features was schema evolution. In HBase, adding a column family requires manual admin intervention. In Lily, it was automatic:
// Version 1
@LilyField(family = "profile")
private String email;
// Version 2
@LilyField(family = "profile")
private String phone; // New field, no migration needed
Lily stored multiple versions of the same record, allowing old code to read new data and vice versa. This was critical for rolling deployments in large organizations.
Production Patterns and Anti-Patterns
Bruno Guedes shared war stories from production:
– Hotspot avoidance: Never use auto-incrementing IDs. Use hashed or UUID-based keys.
– Index explosion: @LilyFullText on large fields → Solr bloat. Use @LilyField(indexed = true) for structured search.
– Compaction storms: Schedule major compactions during low traffic.
– Zookeeper tuning: Increase tick time for large clusters.
The Lily Ecosystem in 2012
Lily shipped with:
– Lily CLI for schema inspection and cluster management.
– Lily Maven Plugin for deploying schemas.
– Lily SolrCloud Integration with automatic sharding.
– Lily Kafka Connect for streaming data ingestion.
Lily’s Legacy After 2018: Where the Ideas Live On
EDIT
Although Lily itself was archived in 2018, its core concepts continue to thrive in modern tools.
The original HBase POJO mapping is now embodied in Spring Data Hadoop.
Lily’s Solr integration has evolved into SolrJ + OpenSearch.
The repository pattern that Lily pioneered is carried forward by Spring Data R2DBC.
Schema evolution, once a key Lily feature, is now handled by Apache Atlas.
Finally, Lily’s near-real-time search capability lives on through the Elasticsearch Percolator.
Conclusion: Big Data Doesn’t Have to Be Hard
Steven Noels closed with a powerful message:
“Big data is not about MapReduce. It’s not about Zookeeper. It’s about solving business problems at scale. Lily proved that Java developers can do that—without becoming data engineers.”
EDIT:
In 2025, as lakehouse architectures, real-time analytics, and AI-driven search dominate, Lily’s vision of big data as a first-class Java citizen remains more relevant than ever.
Links
[DevoxxFR2013] Between HPC and Big Data: A Case Study on Counterparty Risk Simulation
Lecturers
Jonathan Lellouche is a specialist in financial risk modeling, focusing on computational challenges at the intersection of high-performance computing and large-scale data processing.
Adrien Tay Pamart contributes to quantitative finance, developing simulations that balance accuracy with efficiency in volatile markets.
Abstract
Jonathan Lellouche and Adrien Tay Pamart examine counterparty risk simulation in market finance, where modeling asset variability demands intensive computation and massive data aggregation. They dissect a hybrid architecture blending HPC for Monte Carlo paths with big data tools for transactional updates and what-if analyses. Through volatility modeling, scenario generation, and incremental processing, they demonstrate achieving real-time insights amid petabyte-scale inputs. The study evaluates trade-offs in precision, latency, and cost, offering methodologies for similar domains requiring both computational depth and data agility.
Counterparty Risk Fundamentals: Temporal and Probabilistic Dimensions
Lellouche introduces counterparty risk as the potential loss from a trading partner’s default, amplified by market fluctuations. Simulation necessitates modeling time—forward projections of asset prices—and uncertainty via stochastic processes. Traditional approaches like Black-Scholes assume log-normal distributions, but real markets exhibit fat tails, requiring advanced techniques like Heston models for volatility smiles.
The computational burden arises from Monte Carlo methods: generating thousands of paths per instrument, each path a sequence of simulated prices. Pamart explains path dependence in instruments like barriers, where historical values influence payoffs, escalating memory and CPU demands.
Architectural Hybrid: Fusing HPC with Big Data Pipelines
The system partitions workloads: HPC clusters (CPU/GPU) compute raw scenarios; big data frameworks (Hadoop/Spark) aggregate and query. Lellouche details GPU acceleration for path generation, leveraging CUDA/OpenCL for parallel stochastic differential equations:
def simulate_paths(S0, r, sigma, T, steps, paths):
dt = T / steps
dW = np.random.normal(0, np.sqrt(dt), (paths, steps))
S = S0 * np.exp(np.cumsum((r - 0.5 * sigma**2) * dt + sigma * dW, axis=1))
return S
Big data handles post-processing: MapReduce jobs compute exposures, aggregating across scenarios for expected positive exposure (EPE).
Incremental Processing and What-If Analysis: Efficiency in Volatility
Batch recomputation proves untenable for intraday updates. Pamart introduces incremental techniques: delta updates recompute only affected paths on market shifts. What-if simulations—hypothetical trades—leverage precomputed scenarios, overlaying perturbations.
This demands transactional big data stores like HBase for rapid inserts/queries. The duo analyzes latency: sub-second for deltas versus hours for full runs.
Volatility Modeling: From Simple Diffusions to Complex Stochastics
Basic Brownian motion suffices for equities but falters in options. Lellouche advocates local volatility models, calibrating to implied surfaces for accurate pricing. Calibration involves solving inverse problems, often via finite differences accelerated on GPUs.
Pamart warns of model risk: underestimating tails leads to underestimated exposures. Hybrid models blending stochastic volatility with jumps capture crises better.
Cost and Scalability Trade-offs: Cloud vs. On-Premises
On-premises clusters offer control but fixed costs; cloud bursts for peaks. Fonrose-like spot instances (though not directly cited) could slash expenses for non-urgent simulations. The lecturers evaluate AWS EMR for MapReduce, GPU instances for paths.
Implications: hybrid clouds optimize, but data gravity—transferring terabytes—incurs latency and fees.
Future Directions: AI Integration and Regulatory Compliance
Emerging regulations (Basel III) mandate finer-grained simulations, amplifying data volumes. Lellouche speculates on ML for path reduction or anomaly detection.
The case underscores HPC-big data synergy: computation generates insights; data platforms deliver them actionably.