Posts Tagged ‘Kafka’
[NDCOslo2024] Kafka for .NET Developers – Ian Cooper
In the torrent of event-driven ecosystems, where streams supplant silos and resilience reigns, Ian Cooper, a polyglot architect and Brighter’s steward, demystifies Kafka for .NET artisans. As London’s #ldnug founder and a messaging maven, Ian unravels Kafka’s enigma—records, offsets, SerDes, schemas—from novice nods to nuanced integrations. His hour, a whirlwind of wisdom and wireframes, equips ensembles to embed Kafka as backbone, blending brokers with .NET’s breadth for robust, reactive realms.
Ian immerses immediately: Kafka, a distributed commit log, chronicles changes for consumption, contrasting queues’ ephemera. Born from LinkedIn’s logging ledger in 2011, it scaled to streams, spawning Connect for conduits and Flink for flows. Ian’s inflection: Kafka as nervous system, not notification nook—durable, disorderly, decentralized.
Unpacking the Pipeline: Kafka’s Primal Primitives
Kafka’s corpus: topics as ledgers, partitioned for parallelism, replicated for redundancy. Producers pen records—key-value payloads with headers—SerDes serializing strings or structs. Consumers cull via offsets, groups coordinating coordination, enabling elastic elasticity.
Ian illuminates inroads: Confluent’s Cloud for coddling, self-hosted for sovereignty. .NET’s ingress: Confluent.Kafka NuGet, crafting IProducer for publishes, IConsumer for pulls. His handler: await producer.ProduceAsync(topic, new Message {Key = key, Value = serialized}).
Schemas safeguard: registries register Avro or Protobuf, embedding IDs for evolution. Ian’s caveat: magic bytes mandate manual marshaling in .NET, yet compatibility curtails chaos.
Forging Flows: From Fundamentals to Flink Frontiers
Fundamentals flourish: idempotent producers preclude duplicates, transactions tether topics. Ian’s .NET nuance: transactions via BeginTransaction, committing confluences. Exactly-once semantics, once Java’s jewel, beckon .NET via Kafka Streams’ kin.
Connect catalyzes: sink sources to SQL, sources streams from files—redpanda’s kin for Kafka-less kinship. Flink forges further: stream processors paralleling data dances, yet .NET’s niche narrows to basics.
Ian’s interlude: brighter bridges, abstracting brokers for seamless swaps—Rabbit to Kafka—sans syntactic shifts.
Safeguarding Streams: Resilience and Realms
Resilience roots in replicas: in-sync sets (ISR) insure idempotence, unclean leader elections avert anarchy. Ian’s imperative: tune retention—time or tally—for traceability, not torrent.
His horizon: Kafka as canvas for CQRS, where commands commit, queries query—event sourcing’s engine.
Links:
[DevoxxBE2024] A Kafka Producer’s Request: Or, There and Back Again by Danica Fine
Danica Fine, a developer advocate at Confluent, took Devoxx Belgium 2024 attendees on a captivating journey through the lifecycle of a Kafka producer’s request. Her talk demystified the complex process of getting data into Apache Kafka, often treated as a black box by developers. Using a Hobbit-themed example, Danica traced a producer.send() call from client to broker and back, detailing configurations and metrics that impact performance and reliability. By breaking down serialization, partitioning, batching, and broker-side processing, she equipped developers with tools to debug issues and optimize workflows, making Kafka less intimidating and more approachable.
Preparing the Journey: Serialization and Partitioning
Danica began with a simple schema for tracking Hobbit whereabouts, stored in a topic with six partitions and a replication factor of three. The first step in producing data is serialization, converting objects into bytes for brokers, controlled by key and value serializers. Misconfigurations here can lead to errors, so monitoring serialization metrics is crucial. Next, partitioning determines which partition receives the data. The default partitioner uses a key’s hash or sticky partitioning for keyless records to distribute data evenly. Configurations like partitioner.class, partitioner.ignore.keys, and partitioner.adaptive.partitioning.enable allow fine-tuning, with adaptive partitioning favoring faster brokers to avoid hot partitions, especially in high-throughput scenarios like financial services.
Batching for Efficiency
To optimize throughput, Kafka groups records into batches before sending them to brokers. Danica explained key configurations: batch.size (default 16KB) sets the maximum batch size, while linger.ms (default 0) controls how long to wait to fill a batch. Setting linger.ms above zero introduces latency but reduces broker load by sending fewer requests. buffer.memory (default 32MB) allocates space for batches, and misconfigurations can cause memory issues. Metrics like batch-size-avg, records-per-request-avg, and buffer-available-bytes help monitor batching efficiency, ensuring optimal throughput without overwhelming the client.
Sending the Request: Configurations and Metrics
Once batched, data is sent via a produce request over TCP, with configurations like max.request.size (default 1MB) limiting batch volume and acks determining how many replicas must acknowledge the write. Setting acks to “all” ensures high durability but increases latency, while acks=1 or 0 prioritizes speed. enable.idempotence and transactional.id prevent duplicates, with transactions ensuring consistency across sessions. Metrics like request-rate, requests-in-flight, and request-latency-avg provide visibility into request performance, helping developers identify bottlenecks or overloaded brokers.
Broker-Side Processing: From Socket to Disk
On the broker, requests enter the socket receive buffer, then are processed by network threads (default 3) and added to the request queue. IO threads (default 8) validate data with a cyclic redundancy check and write it to the page cache, later flushing to disk. Configurations like num.network.threads, num.io.threads, and queued.max.requests control thread and queue sizes, with metrics like network-processor-avg-idle-percent and request-handler-avg-idle-percent indicating thread utilization. Data is stored in a commit log with log, index, and snapshot files, supporting efficient retrieval and idempotency. The log.flush.rate and local-time-ms metrics ensure durable storage.
Replication and Response: Completing the Journey
Unfinished requests await replication in a “purgatory” data structure, with follower brokers fetching updates every 500ms (often faster). The remote-time-ms metric tracks replication duration, critical for acks=all. Once replicated, the broker builds a response, handled by network threads and queued in the response queue. Metrics like response-queue-time-ms and total-time-ms measure the full request lifecycle. Danica emphasized that understanding these stages empowers developers to collaborate with operators, tweaking configurations like default.replication.factor or topic-level settings to optimize performance.
Empowering Developers with Kafka Knowledge
Danica concluded by encouraging developers to move beyond treating Kafka as a black box. By mastering configurations and monitoring metrics, they can proactively address issues, from serialization errors to replication delays. Her talk highlighted resources like Confluent Developer for guides and courses on Kafka internals. This knowledge not only simplifies debugging but also fosters better collaboration with operators, ensuring robust, efficient data pipelines.
Links:
[DevoxxUK2024] Processing XML with Kafka Connect by Dale Lane
Dale Lane, a seasoned developer at IBM with a deep focus on event-driven architectures, delivered a compelling session at DevoxxUK2024, unveiling a powerful Kafka Connect plugin designed to streamline XML data processing. With extensive experience in Apache Kafka and Flink, Dale addressed the challenges of integrating XML data into Kafka pipelines, a task often fraught with complexity due to the format’s incompatibility with Kafka’s native data structures like Avro or JSON. His presentation offers practical solutions for developers seeking to bridge external systems with Kafka, transforming XML into more manageable formats or generating XML outputs for legacy systems. Through clear examples, Dale illustrates how this open-source plugin enhances flexibility and efficiency in Kafka Connect pipelines, empowering developers to handle diverse data integration scenarios with ease.
Understanding Kafka Connect Pipelines
Dale begins by demystifying Kafka Connect, a robust framework for moving data between Kafka and external systems. He outlines two primary pipeline types: source pipelines, which import data from external systems into Kafka, and sink pipelines, which export Kafka data to external destinations. A source pipeline typically involves a connector to fetch data, optional transformations to modify or filter it, and a converter to serialize the data into formats like Avro or JSON for Kafka topics. Conversely, a sink pipeline starts with a converter to deserialize Kafka data, followed by transformations and a connector to deliver it to an external system. This foundational explanation sets the stage for understanding where and how XML processing fits into these workflows, ensuring developers grasp the pipeline’s modular structure before diving into specific use cases.
Converting XML for Kafka Integration
A common challenge Dale addresses is integrating XML data from external systems, such as IBM MQ or XML-based web services, into Kafka’s ecosystem, which favors structured formats. He introduces the Kafka Connect plugin, available on GitHub under an Apache license, as a solution to parse XML into structured records early in the pipeline. For instance, using an IBM MQ source connector, the plugin can transform XML documents from a message queue into a generic structured format, allowing subsequent transformations and serialization into JSON or Avro. Dale demonstrates this with a weather API that returns XML strings, showing how the plugin converts these into structured objects for further processing, making them compatible with Kafka tools that struggle with raw XML. This approach significantly enhances the usability of external data within Kafka’s ecosystem.
Generating XML Outputs from Kafka
For scenarios where external systems require XML, Dale showcases the plugin’s ability to convert Kafka’s JSON or Avro messages into XML strings within a sink pipeline. He provides an example using a Kafka topic with JSON messages destined for an IBM MQ system, where the plugin, integrated as part of the sink connector, transforms structured data into XML before delivery. Another case involves an HTTP sink connector posting to an XML-based web service, such as an XML-RPC API. Here, the pipeline deserializes JSON, applies transformations to align with the API’s payload requirements, and uses the plugin to produce an XML string. This flexibility ensures seamless communication with legacy systems, bridging modern Kafka workflows with traditional XML-based infrastructure.
Enhancing Pipelines with Schema Support
Dale emphasizes the plugin’s schema handling capabilities, which add robustness to XML processing. In source pipelines, the plugin can reference an external XSD schema to validate and structure XML data, which is then paired with an Avro converter to submit schemas to a registry, ensuring compatibility with Kafka’s schema-driven ecosystem. In sink pipelines, enabling schema inclusion generates an XSD alongside the XML output, providing a clear description of the data’s structure. Dale illustrates this with a stock price connector, where enabling schema support produces XML events with accompanying XSDs, enhancing interoperability. This feature is particularly valuable for maintaining data integrity across systems, making the plugin a versatile tool for complex integration tasks.
Links:
Kafka Streams @ Carrefour : Traitement big data à la vitesse de l’éclair
Lors de Devoxx France 2022, François Sarradin et Jérémy Sebayhi, membres des équipes data de Carrefour, ont partagé un retour d’expérience de 45 minutes sur l’utilisation de Kafka Streams pour des pipelines big data en temps réel. François, technical lead chez Moshi, et Jérémy, ingénieur senior chez Carrefour, ont détaillé leur transition des systèmes batch Spark et Hadoop vers un traitement stream réactif sur Google Cloud Platform (GCP). Leur talk a couvert l’adoption de Kafka Streams pour le calcul des stocks et des prix, les défis rencontrés et les solutions créatives mises en œuvre. Découvrez Carrefour sur carrefour.com et Moshi sur moshi.fr.
Du batch au stream processing
François et Jérémy ont débuté en comparant le traitement batch et stream. La plateforme legacy de Carrefour, datant de 2014, reposait sur Spark et Hadoop pour des jobs batch, traitant les données comme des fichiers avec des entrées et sorties claires. Les erreurs étaient gérables en corrigeant les fichiers d’entrée et en relançant les pipelines. Le streaming, en revanche, implique des flux d’événements continus via des topics Kafka, où les erreurs nécessitent une gestion en temps réel sans perturber le pipeline. Un événement corrompu ne peut être simplement supprimé, car les données historiques peuvent couvrir des années, rendant le reprocessing impraticable.
Kafka Streams, un framework réactif basé sur Apache Kafka, a permis à Carrefour de passer au stream processing. Il exploite Kafka pour un transit de données scalable et RocksDB pour un stockage d’état colocalisé à faible latence. François a expliqué que les développeurs définissent des topologies—graphes acycliques dirigés (DAG) similaires à ceux de Spark—avec des opérations comme map, flatMap, reduce et join. Kafka Streams gère automatiquement la création des topics, les stores d’état et la résilience, simplifiant le développement. L’intégration avec les services GCP (GCS, GKE, BigTable) et les systèmes internes de Carrefour a permis des calculs de stocks et de prix en temps réel à l’échelle nationale.
Surmonter les défis d’adoption
Adopter Kafka Streams chez Carrefour n’a pas été sans obstacles. Jérémy a noté que beaucoup d’équipes manquaient d’expérience avec Kafka, mais la familiarité avec Spark a facilité la transition, les deux utilisant des paradigmes de transformation similaires. Les équipes ont développé indépendamment des pratiques pour le monitoring, la configuration et le déploiement, consolidées ensuite en best practices partagées. Cette approche pragmatique a créé une base commune pour les nouveaux projets, accélérant l’adoption.
Le changement nécessitait une adaptation culturelle au-delà des compétences techniques. La plateforme data de Carrefour, gérant des volumes massifs et des données à haute vélocité (stocks, prix, commandes), exigeait un changement de mindset du batch vers le réactif. Le stream processing implique des jointures continues avec des bases externes, contrairement aux datasets statiques des batchs. François et Jérémy ont souligné l’importance d’une documentation précoce et d’un accompagnement expert pour naviguer dans les complexités de Kafka Streams, surtout lors des déploiements en production.
Bonnes pratiques et architectures
François et Jérémy ont partagé les pratiques clés émergées sur deux ans. Pour les schémas des topics, ils utilisaient Schema Registry pour typer les données, préférant des clés obligatoires pour assurer la stabilité des partitions et évitant les champs optionnels pour prévenir les ruptures de contrat. Les valeurs des messages incluaient des champs optionnels pour la flexibilité, avec des champs obligatoires comme les IDs et timestamps pour le débogage et l’ordonnancement des événements.
Maintenir des topologies stateful posait des défis. Ajouter de nouvelles transformations (par exemple, une nouvelle source de données) nécessitait de retraiter les données historiques, risquant des émissions dupliquées. Ils ont proposé des solutions comme les déploiements blue-green, où la nouvelle version construit son état sans produire de sortie jusqu’à ce qu’elle soit prête, ou l’utilisation de topics compactés comme snapshots pour stocker uniquement le dernier état par clé. Ces approches minimisaient les perturbations mais exigeaient une planification rigoureuse, les déploiements blue-green doublant temporairement les besoins en ressources.
Métriques et monitoring
Le monitoring des applications Kafka Streams était crucial. François a mis en avant des métriques clés : lag (messages en attente par topic/consumer group), indiquant les points de contention ; end-to-end latency, mesurant le temps de traitement par nœud de topologie ; et rebalance events, déclenchés par des changements de consumer group, pouvant perturber les performances. Carrefour utilisait Prometheus pour collecter les métriques et Grafana pour des dashboards, assurant une détection proactive des problèmes. Jérémy a insisté sur l’importance des métriques custom via une couche web pour les health checks, les métriques JMX de Kafka Streams n’étant pas toujours suffisantes.
Ils ont aussi abordé les défis de déploiement, utilisant Kubernetes (GKE) avec des readiness probes pour surveiller les états des applications. Une surallocation de CPU pouvait retarder les réponses aux health checks, causant des évictions de consumer groups, d’où l’importance d’un tuning précis des ressources. François et Jérémy ont conclu en vantant l’écosystème robuste de Kafka Streams—connecteurs, bibliothèques de test, documentation—tout en notant que sa nature événementielle exige un mindset distinct du batch. Leur expérience chez Carrefour a démontré la puissance de Kafka Streams pour des données en temps réel à grande échelle, incitant le public à partager ses propres retours.
[DevoxxFR 2018] Apache Kafka: Beyond the Brokers – Exploring the Ecosystem
Apache Kafka is often recognized for its high-throughput, distributed messaging capabilities, but its power extends far beyond just the brokers. Florent Ramière from Confluent, a company significantly contributing to Kafka’s development, presented a comprehensive tour of the Kafka ecosystem at DevoxxFR2018. He aimed to showcase the array of open-source components that revolve around Kafka, enabling robust data integration, stream processing, and more.
Kafka Fundamentals and the Confluent Platform
Florent began with a quick refresher on Kafka’s core concept: an ordered, replayable log of messages (events) where consumers can read at their own pace from specific offsets. This design provides scalability, fault tolerance, and guaranteed ordering (within a partition), making it a cornerstone for event-driven architectures and handling massive data streams (Confluent sees clients handling up to 60 GB/s).
To get started, while Kafka involves several components like brokers and ZooKeeper, the Confluent Platform offers tools to simplify setup. The confluent CLI can start a local development environment with Kafka, ZooKeeper, Kafka SQL (ksqlDB), Schema Registry, and more with a single command. Docker images are also readily available for containerized deployments.
Kafka Connect: Bridging Kafka with External Systems
A significant part of the ecosystem is Kafka Connect, a framework for reliably streaming data between Kafka and other systems. Connectors act as sources (ingesting data into Kafka from databases, message queues, etc.) or sinks (exporting data from Kafka to data lakes, search indexes, analytics platforms, etc.). Florent highlighted the availability of numerous pre-built connectors for systems like JDBC databases, Elasticsearch, HDFS, S3, and Change Data Capture (CDC) tools.
He drew a parallel between Kafka Connect and Logstash, noting that while Logstash is excellent, Kafka Connect is designed as a distributed, fault-tolerant, and scalable service for these data integration tasks. It allows for transformations (e.g., filtering, renaming fields, anonymization) within the Connect pipeline via a REST API for configuration. This makes it a powerful tool for building data pipelines without writing extensive custom code.
Stream Processing with Kafka Streams and ksqlDB
Once data is in Kafka, processing it in real-time is often the next step. Kafka Streams is a client library for building stream processing applications directly in Java (or Scala). Unlike frameworks like Spark or Flink that often require separate processing clusters, Kafka Streams applications are standalone Java applications that read from Kafka, process data, and can write results back to Kafka or external systems. This simplifies deployment and monitoring. Kafka Streams provides rich DSL for operations like filtering, mapping, joining streams and tables (a table in Kafka Streams is a view of the latest value for each key in a stream), windowing, and managing state, all with exactly-once processing semantics.
For those who prefer SQL to Java/Scala, ksqlDB (formerly Kafka SQL or KSQL) offers a SQL-like declarative language to define stream processing logic on top of Kafka topics. Users can create streams and tables from Kafka topics, perform continuous queries (SELECT statements that run indefinitely, emitting results as new data arrives), joins, aggregations over windows, and write results to new Kafka topics. ksqlDB runs as a separate server and uses Kafka Streams internally. It also manages stateful operations by storing state in RocksDB and backing it up to Kafka topics for fault tolerance. Florent emphasized that while ksqlDB is powerful for many use cases, complex UDFs or very intricate logic might still be better suited for Kafka Streams directly.
Schema Management and Other Essential Tools
When dealing with structured data in Kafka, especially in evolving systems, schema management becomes crucial. The Confluent Schema Registry helps manage and enforce schemas (typically Avro, but also Protobuf and JSON Schema) for messages in Kafka topics. It ensures schema compatibility (e.g., backward, forward, full compatibility) as schemas evolve, preventing data quality issues and runtime errors in producers and consumers. REST Proxy allows non-JVM applications to produce and consume messages via HTTP. Kafka also ships with command-line tools for performance testing (e.g., kafka-producer-perf-test, kafka-consumer-perf-test), latency checking, and inspecting consumer group lags, which are vital for operations and troubleshooting. Effective monitoring, often using JMX metrics exposed by Kafka components fed into systems like Prometheus via JMX Exporter or Jolokia, is also critical for production deployments.
Florent concluded by encouraging exploration of the Confluent Platform demos and his “kafka-story” GitHub repository, which provide step-by-step examples of these ecosystem components.
Links:
- Confluent Platform
- Apache Kafka Connect Documentation
- Apache Kafka Streams Documentation
- ksqlDB (formerly KSQL) Website
- Confluent Schema Registry Documentation
- Florent Ramière’s Kafka Story on GitHub
- Confluent Blog
Hashtags: #ApacheKafka #KafkaConnect #KafkaStreams #ksqlDB #Confluent #StreamProcessing #DataIntegration #DevoxxFR2018 #FlorentRamiere #EventDrivenArchitecture #Microservices #BigData
[DevoxxFR2015] Advanced Streaming with Apache Kafka
Jonathan Winandy and Alexis Guéganno, co-founder and operations director at Valwin, respectively, presented a deep dive into advanced Apache Kafka streaming techniques at Devoxx France 2015. With expertise in distributed systems and data warehousing, they explored how Kafka enables flexible, high-performance real-time streaming beyond basic JSON payloads.
Foundations of Streaming
Jonathan opened with a concise overview of streaming, emphasizing Kafka’s role in real-time distributed systems. He explained how Kafka’s topic-based architecture supports high-throughput data pipelines. Their session moved beyond introductory concepts, focusing on advanced writing, modeling, and querying techniques to ensure robust, future-proof streaming solutions.
This foundation, Jonathan noted, sets the stage for scalability.
Advanced Modeling and Querying
Alexis detailed Kafka’s ability to handle structured data, moving past schemaless JSON. They showcased techniques for defining schemas and optimizing queries, improving performance and maintainability. Q&A revealed their use of a five-node cluster for fault tolerance, sufficient for basic journaling but scalable to hundreds for larger workloads.
These methods, Alexis highlighted, enhance data reliability.
Managing Kafka Clusters
Jonathan addressed cluster management, noting that five nodes ensure fault tolerance, while larger clusters handle extensive partitioning. They discussed load balancing and lag management, critical for high-volume environments. The session also covered Kafka’s integration with databases, enabling real-time data synchronization.
This scalability, Jonathan concluded, supports diverse use cases.
Community Engagement and Resources
The duo encouraged engagement through Scala.IO, where Jonathan organizes, and shared Valwin’s expertise in data solutions. Their insights into cluster sizing and health monitoring, particularly in regulated sectors like healthcare, underscored Kafka’s versatility.
This session equips developers for advanced streaming challenges.