Jonathan Lalou's Blog

[DevoxxFR 2018] Apache Kafka: Beyond the Brokers – Exploring the Ecosystem

Author: Jonathan Lalou | June 4, 2018

Apache Kafka is often recognized for its high-throughput, distributed messaging capabilities, but its power extends far beyond just the brokers. Florent Ramière from Confluent, a company significantly contributing to Kafka’s development, presented a comprehensive tour of the Kafka ecosystem at DevoxxFR2018. He aimed to showcase the array of open-source components that revolve around Kafka, enabling robust data integration, stream processing, and more.

Kafka Fundamentals and the Confluent Platform

Florent began with a quick refresher on Kafka’s core concept: an ordered, replayable log of messages (events) where consumers can read at their own pace from specific offsets. This design provides scalability, fault tolerance, and guaranteed ordering (within a partition), making it a cornerstone for event-driven architectures and handling massive data streams (Confluent sees clients handling up to 60 GB/s).

To get started, while Kafka involves several components like brokers and ZooKeeper, the Confluent Platform offers tools to simplify setup. The confluent CLI can start a local development environment with Kafka, ZooKeeper, Kafka SQL (ksqlDB), Schema Registry, and more with a single command. Docker images are also readily available for containerized deployments.

Kafka Connect: Bridging Kafka with External Systems

A significant part of the ecosystem is Kafka Connect, a framework for reliably streaming data between Kafka and other systems. Connectors act as sources (ingesting data into Kafka from databases, message queues, etc.) or sinks (exporting data from Kafka to data lakes, search indexes, analytics platforms, etc.). Florent highlighted the availability of numerous pre-built connectors for systems like JDBC databases, Elasticsearch, HDFS, S3, and Change Data Capture (CDC) tools.

He drew a parallel between Kafka Connect and Logstash, noting that while Logstash is excellent, Kafka Connect is designed as a distributed, fault-tolerant, and scalable service for these data integration tasks. It allows for transformations (e.g., filtering, renaming fields, anonymization) within the Connect pipeline via a REST API for configuration. This makes it a powerful tool for building data pipelines without writing extensive custom code.

Stream Processing with Kafka Streams and ksqlDB

Once data is in Kafka, processing it in real-time is often the next step. Kafka Streams is a client library for building stream processing applications directly in Java (or Scala). Unlike frameworks like Spark or Flink that often require separate processing clusters, Kafka Streams applications are standalone Java applications that read from Kafka, process data, and can write results back to Kafka or external systems. This simplifies deployment and monitoring. Kafka Streams provides rich DSL for operations like filtering, mapping, joining streams and tables (a table in Kafka Streams is a view of the latest value for each key in a stream), windowing, and managing state, all with exactly-once processing semantics.

For those who prefer SQL to Java/Scala, ksqlDB (formerly Kafka SQL or KSQL) offers a SQL-like declarative language to define stream processing logic on top of Kafka topics. Users can create streams and tables from Kafka topics, perform continuous queries (SELECT statements that run indefinitely, emitting results as new data arrives), joins, aggregations over windows, and write results to new Kafka topics. ksqlDB runs as a separate server and uses Kafka Streams internally. It also manages stateful operations by storing state in RocksDB and backing it up to Kafka topics for fault tolerance. Florent emphasized that while ksqlDB is powerful for many use cases, complex UDFs or very intricate logic might still be better suited for Kafka Streams directly.

Schema Management and Other Essential Tools

When dealing with structured data in Kafka, especially in evolving systems, schema management becomes crucial. The Confluent Schema Registry helps manage and enforce schemas (typically Avro, but also Protobuf and JSON Schema) for messages in Kafka topics. It ensures schema compatibility (e.g., backward, forward, full compatibility) as schemas evolve, preventing data quality issues and runtime errors in producers and consumers. REST Proxy allows non-JVM applications to produce and consume messages via HTTP. Kafka also ships with command-line tools for performance testing (e.g., kafka-producer-perf-test, kafka-consumer-perf-test), latency checking, and inspecting consumer group lags, which are vital for operations and troubleshooting. Effective monitoring, often using JMX metrics exposed by Kafka components fed into systems like Prometheus via JMX Exporter or Jolokia, is also critical for production deployments.

Florent concluded by encouraging exploration of the Confluent Platform demos and his “kafka-story” GitHub repository, which provide step-by-step examples of these ecosystem components.

Links:

Hashtags: #ApacheKafka #KafkaConnect #KafkaStreams #ksqlDB #Confluent #StreamProcessing #DataIntegration #DevoxxFR2018 #FlorentRamiere #EventDrivenArchitecture #Microservices #BigData

Posted in en-US | Tags: Devoxx, DevoxxFR, Kafka