Recent Posts
Archives

Posts Tagged ‘DataPipelines’

PostHeaderIcon [DevoxxUK2025] The Art of Structuring Real-Time Data Streams into Actionable Insights

At DevoxxUK2025, Olena Kutsenko, a data streaming expert from Confluent, delivered a compelling session on transforming chaotic real-time data streams into structured, actionable insights using Apache Kafka, Apache Flink, and Apache Iceberg. Through practical demos involving IoT devices and social media data, Olena demonstrated how to build scalable, low-latency data pipelines that ensure high data quality and flexibility for downstream analytics and AI applications. Her talk highlighted the power of combining these open-source technologies to handle messy, high-volume data streams, making them accessible for querying, visualization, and decision-making.

Apache Kafka: The Scalable Message Bus

Olena introduced Apache Kafka as the foundation for handling high-speed data streams, acting as a scalable message bus that decouples data producers (e.g., IoT devices) from consumers. Kafka’s design, with topics and partitions likened to multi-lane roads, ensures high throughput and low latency. In her IoT demo, Olena used a JavaScript producer to ingest sensor data (temperature, battery levels) into a Kafka topic, handling messy data with duplicates or missing sensor IDs. Kafka’s ability to replicate data and retain it for a defined period ensures reliability, allowing reprocessing if needed, making it ideal for industries like banking and retail, such as REWE’s use of Kafka for processing sold items.

Apache Flink: Real-Time Data Processing

Apache Flink was showcased as the engine for cleaning and structuring Kafka streams in real time. Olena explained Flink’s ability to handle both unbounded (real-time) and bounded (historical) data, using SQL for transformations. In the IoT demo, she applied a row_number function to deduplicate records by sensor ID and timestamp, filtered out invalid data (e.g., null sensor IDs), and reformatted timestamps to include time zones. A 5-second watermark ignored late-arriving data, and a tumbling window aggregated data into one-minute buckets, enriched with averages and standard deviations, ensuring clean, structured data ready for analysis.

Apache Iceberg: Structured Storage for Analytics

Olena introduced Apache Iceberg as an open table format that brings data warehouse-like structure to data lakes. Developed at Netflix to address Apache Hive’s limitations, Iceberg ensures atomic transactions and schema evolution without rewriting data. Its metadata layer, including manifest files and snapshots, supports time travel and efficient querying. In the demo, Flink’s processed data was written to Iceberg-compatible Kafka topics using Confluent’s Quora engine, eliminating extra migrations. Iceberg’s structure enabled fast queries and versioning, critical for analytics and compliance in regulated environments.

Querying and Visualization with Trino and Superset

To make data actionable, Olena used Trino, a distributed query engine, to run fast queries on Iceberg tables, and Apache Superset for visualization. In the IoT demo, Superset visualized temperature and humidity distributions, highlighting outliers. In a playful social media demo using Blue Sky data, Olena enriched posts with sentiment analysis (positive, negative, neutral) and category classification via a GPT-3.5 Turbo model, integrated via Flink. Superset dashboards displayed author activity and sentiment distributions, demonstrating how structured data enables intuitive insights for non-technical users.

Ensuring Data Integrity and Scalability

Addressing audience questions, Olena explained Flink’s exactly-once processing guarantee, using watermarks and snapshots to ensure data integrity, even during failures. Kafka’s retention policies allow reprocessing, critical for regulatory compliance, though she noted custom solutions are often needed for audit evidence in financial sectors. Flink’s parallel processing scales effectively with Kafka’s partitioned topics, handling high-volume data without bottlenecks, making the pipeline robust for dynamic workloads like IoT or fraud detection in banking.

Links:

PostHeaderIcon [DevoxxPL2022] Before It’s Too Late: Finding Real-Time Holes in Data • Chayim Kirshen

Chayim Kirshen, a veteran of the startup ecosystem and client manager at Redis, captivated audiences at Devoxx Poland 2022 with a dynamic exploration of real-time data pipeline challenges. Drawing from his experience with high-stakes environments, including a 2010 stock exchange meltdown, Chayim outlined strategies to ensure data integrity and performance in large-scale systems. His talk provided actionable insights for developers, emphasizing the importance of storing raw data, parsing in real time, and leveraging technologies like Redis to address data inconsistencies.

The Perils of Unclean Data

Chayim began with a stark reality: data is rarely clean. Recounting a 2010 incident where hackers compromised a major stock exchange’s API, he highlighted the cascading effects of unreliable data on real-time markets. Data pipelines face issues like inconsistent formats (CSV, JSON, XML), changing sources (e.g., shifting API endpoints), and service reliability, with modern systems often tolerating over a thousand minutes of downtime annually. These challenges disrupt real-time processing, critical for applications like stock exchanges or ad bidding networks requiring sub-100ms responses. Chayim advocated treating data as programmable code, enabling developers to address issues systematically rather than reactively.

Building Robust Data Pipelines

To tackle these issues, Chayim proposed a structured approach to data pipeline design. Storing raw data indefinitely—whether in S3, Redis, or other storage—ensures a fallback for reprocessing. Parsing data in real time, using defined schemas, allows immediate usability while preserving raw inputs. Bulk changes, such as SQL bulk inserts or Redis pipelines, reduce network overhead, critical for high-throughput systems. Chayim emphasized scheduling regular backfills to re-import historical data, ensuring consistency despite source changes. For example, a stock exchange’s ticker symbol updates (e.g., Fitbit to Google) require ongoing reprocessing to maintain accuracy. Horizontal scaling, using disposable nodes, enhances availability and resilience, avoiding single points of failure.

Real-Time Enrichment and Redis Integration

Data enrichment, such as calculating stock bid-ask spreads or market cap changes, should occur post-ingestion to avoid slowing the pipeline. Chayim showcased Redis, particularly its Gears and JSON modules, for real-time data processing. Redis acts as a buffer, storing raw JSON and replicating it to traditional databases like PostgreSQL or MySQL. Using Redis Gears, developers can execute functions within the database, minimizing network costs and enabling rapid enrichment. For instance, calculating a stock’s daily percentage change can run directly in Redis, streamlining analytics. Chayim highlighted Python-based tools like Celery for scheduling backfills and enrichments, allowing asynchronous processing and failure retries without disrupting the main pipeline.

Scaling and Future-Proofing

Chayim stressed horizontal scaling to distribute workloads geographically, placing data closer to users for low-latency access, as seen in ad networks. By using Redis for real-time writes and offloading to workers via Celery, developers can manage millions of daily entries, such as stock ticks, without performance bottlenecks. Scheduled backfills address data gaps, like API schema changes (e.g., integer to string conversions), by reprocessing raw data. This approach, combined with infrastructure-as-code tools like Terraform, ensures scalability and adaptability, allowing organizations to focus on business logic rather than data management overhead.

Links: