Recent Posts
Archives

Posts Tagged ‘ApacheIceberg’

PostHeaderIcon [DevoxxUK2025] The Art of Structuring Real-Time Data Streams into Actionable Insights

At DevoxxUK2025, Olena Kutsenko, a data streaming expert from Confluent, delivered a compelling session on transforming chaotic real-time data streams into structured, actionable insights using Apache Kafka, Apache Flink, and Apache Iceberg. Through practical demos involving IoT devices and social media data, Olena demonstrated how to build scalable, low-latency data pipelines that ensure high data quality and flexibility for downstream analytics and AI applications. Her talk highlighted the power of combining these open-source technologies to handle messy, high-volume data streams, making them accessible for querying, visualization, and decision-making.

Apache Kafka: The Scalable Message Bus

Olena introduced Apache Kafka as the foundation for handling high-speed data streams, acting as a scalable message bus that decouples data producers (e.g., IoT devices) from consumers. Kafka’s design, with topics and partitions likened to multi-lane roads, ensures high throughput and low latency. In her IoT demo, Olena used a JavaScript producer to ingest sensor data (temperature, battery levels) into a Kafka topic, handling messy data with duplicates or missing sensor IDs. Kafka’s ability to replicate data and retain it for a defined period ensures reliability, allowing reprocessing if needed, making it ideal for industries like banking and retail, such as REWE’s use of Kafka for processing sold items.

Apache Flink: Real-Time Data Processing

Apache Flink was showcased as the engine for cleaning and structuring Kafka streams in real time. Olena explained Flink’s ability to handle both unbounded (real-time) and bounded (historical) data, using SQL for transformations. In the IoT demo, she applied a row_number function to deduplicate records by sensor ID and timestamp, filtered out invalid data (e.g., null sensor IDs), and reformatted timestamps to include time zones. A 5-second watermark ignored late-arriving data, and a tumbling window aggregated data into one-minute buckets, enriched with averages and standard deviations, ensuring clean, structured data ready for analysis.

Apache Iceberg: Structured Storage for Analytics

Olena introduced Apache Iceberg as an open table format that brings data warehouse-like structure to data lakes. Developed at Netflix to address Apache Hive’s limitations, Iceberg ensures atomic transactions and schema evolution without rewriting data. Its metadata layer, including manifest files and snapshots, supports time travel and efficient querying. In the demo, Flink’s processed data was written to Iceberg-compatible Kafka topics using Confluent’s Quora engine, eliminating extra migrations. Iceberg’s structure enabled fast queries and versioning, critical for analytics and compliance in regulated environments.

Querying and Visualization with Trino and Superset

To make data actionable, Olena used Trino, a distributed query engine, to run fast queries on Iceberg tables, and Apache Superset for visualization. In the IoT demo, Superset visualized temperature and humidity distributions, highlighting outliers. In a playful social media demo using Blue Sky data, Olena enriched posts with sentiment analysis (positive, negative, neutral) and category classification via a GPT-3.5 Turbo model, integrated via Flink. Superset dashboards displayed author activity and sentiment distributions, demonstrating how structured data enables intuitive insights for non-technical users.

Ensuring Data Integrity and Scalability

Addressing audience questions, Olena explained Flink’s exactly-once processing guarantee, using watermarks and snapshots to ensure data integrity, even during failures. Kafka’s retention policies allow reprocessing, critical for regulatory compliance, though she noted custom solutions are often needed for audit evidence in financial sectors. Flink’s parallel processing scales effectively with Kafka’s partitioned topics, handling high-volume data without bottlenecks, making the pipeline robust for dynamic workloads like IoT or fraud detection in banking.

Links:

PostHeaderIcon [DevoxxFR2025] Spark 4 and Iceberg: The New Standard for All Your Data Projects

The world of big data is constantly evolving, with new technologies emerging to address the challenges of managing and processing ever-increasing volumes of data. Apache Spark has long been a dominant force in big data processing, and its evolution continues with Spark 4. Complementing this is Apache Iceberg, a modern table format that is rapidly becoming the standard for managing data lakes. Pierre Andrieux from Capgemini and Houssem Chihoub from Databricks joined forces to demonstrate how the combination of Spark 4 and Iceberg is set to revolutionize data projects, offering improved performance, enhanced data management capabilities, and a more robust foundation for data lakes.

Spark 4: Boosting Performance and Data Lake Support

Pierre and Houssem highlighted the major new features and enhancements in Apache Spark 4. A key area of improvement is performance, with a new query engine and automatic query optimization designed to accelerate data processing workloads. Spark 4 also brings enhanced native support for data lakes, simplifying interactions with data stored in formats like Parquet and ORC on distributed file systems. This tighter integration improves efficiency and reduces the need for external connectors or complex configurations. The presentation showcased benchmarks or performance comparisons illustrating the gains achieved with Spark 4, particularly when working with large datasets in a data lake environment.

Apache Iceberg Demystified: A Next-Generation Table Format

Apache Iceberg addresses the limitations of traditional table formats used in data lakes. Houssem demystified Iceberg, explaining that it provides a layer of abstraction on top of data files, bringing database-like capabilities to data lakes. Key features of Iceberg include:
Time Travel: The ability to query historical snapshots of a table, enabling reproducible reports and simplified data rollbacks.
Schema Evolution: Support for safely evolving table schemas over time (e.g., adding, dropping, or renaming columns) without requiring costly data rewrites.
Dynamic Partitioning: Iceberg automatically manages data partitioning, optimizing query performance based on query patterns without manual intervention.
Atomic Commits: Ensures that changes to a table are atomic, providing reliability and consistency even in distributed environments.

These features solve many of the pain points associated with managing data lakes, such as schema management complexities, difficulty in handling updates and deletions, and lack of transactionality.

The Power of Combination: Spark 4 and Iceberg

The true power lies in combining the processing capabilities of Spark 4 with the data management features of Iceberg. Pierre and Houssem demonstrated through concrete use cases and practical demonstrations how this combination enables building modern data pipelines. They showed how Spark 4 can efficiently read from and write to Iceberg tables, leveraging Iceberg’s features like time travel for historical analysis or schema evolution for seamlessly integrating data with changing structures. The integration allows data engineers and data scientists to work with data lakes with greater ease, reliability, and performance, making this combination a compelling new standard for data projects. The talk covered best practices for implementing data pipelines with Spark 4 and Iceberg and discussed potential pitfalls to avoid, providing attendees with the knowledge to leverage these technologies effectively in their own data initiatives.

Links: