Recent Posts
Archives

Posts Tagged ‘ApacheFlink’

PostHeaderIcon [DevoxxUK2025] The Art of Structuring Real-Time Data Streams into Actionable Insights

At DevoxxUK2025, Olena Kutsenko, a data streaming expert from Confluent, delivered a compelling session on transforming chaotic real-time data streams into structured, actionable insights using Apache Kafka, Apache Flink, and Apache Iceberg. Through practical demos involving IoT devices and social media data, Olena demonstrated how to build scalable, low-latency data pipelines that ensure high data quality and flexibility for downstream analytics and AI applications. Her talk highlighted the power of combining these open-source technologies to handle messy, high-volume data streams, making them accessible for querying, visualization, and decision-making.

Apache Kafka: The Scalable Message Bus

Olena introduced Apache Kafka as the foundation for handling high-speed data streams, acting as a scalable message bus that decouples data producers (e.g., IoT devices) from consumers. Kafka’s design, with topics and partitions likened to multi-lane roads, ensures high throughput and low latency. In her IoT demo, Olena used a JavaScript producer to ingest sensor data (temperature, battery levels) into a Kafka topic, handling messy data with duplicates or missing sensor IDs. Kafka’s ability to replicate data and retain it for a defined period ensures reliability, allowing reprocessing if needed, making it ideal for industries like banking and retail, such as REWE’s use of Kafka for processing sold items.

Apache Flink: Real-Time Data Processing

Apache Flink was showcased as the engine for cleaning and structuring Kafka streams in real time. Olena explained Flink’s ability to handle both unbounded (real-time) and bounded (historical) data, using SQL for transformations. In the IoT demo, she applied a row_number function to deduplicate records by sensor ID and timestamp, filtered out invalid data (e.g., null sensor IDs), and reformatted timestamps to include time zones. A 5-second watermark ignored late-arriving data, and a tumbling window aggregated data into one-minute buckets, enriched with averages and standard deviations, ensuring clean, structured data ready for analysis.

Apache Iceberg: Structured Storage for Analytics

Olena introduced Apache Iceberg as an open table format that brings data warehouse-like structure to data lakes. Developed at Netflix to address Apache Hive’s limitations, Iceberg ensures atomic transactions and schema evolution without rewriting data. Its metadata layer, including manifest files and snapshots, supports time travel and efficient querying. In the demo, Flink’s processed data was written to Iceberg-compatible Kafka topics using Confluent’s Quora engine, eliminating extra migrations. Iceberg’s structure enabled fast queries and versioning, critical for analytics and compliance in regulated environments.

Querying and Visualization with Trino and Superset

To make data actionable, Olena used Trino, a distributed query engine, to run fast queries on Iceberg tables, and Apache Superset for visualization. In the IoT demo, Superset visualized temperature and humidity distributions, highlighting outliers. In a playful social media demo using Blue Sky data, Olena enriched posts with sentiment analysis (positive, negative, neutral) and category classification via a GPT-3.5 Turbo model, integrated via Flink. Superset dashboards displayed author activity and sentiment distributions, demonstrating how structured data enables intuitive insights for non-technical users.

Ensuring Data Integrity and Scalability

Addressing audience questions, Olena explained Flink’s exactly-once processing guarantee, using watermarks and snapshots to ensure data integrity, even during failures. Kafka’s retention policies allow reprocessing, critical for regulatory compliance, though she noted custom solutions are often needed for audit evidence in financial sectors. Flink’s parallel processing scales effectively with Kafka’s partitioned topics, handling high-volume data without bottlenecks, making the pipeline robust for dynamic workloads like IoT or fraud detection in banking.

Links: