Jonathan Lalou's Blog

Posts Tagged ‘PerformanceTuning’

[DevoxxBE2024] A Kafka Producer’s Request: Or, There and Back Again by Danica Fine

Danica Fine, a developer advocate at Confluent, took Devoxx Belgium 2024 attendees on a captivating journey through the lifecycle of a Kafka producer’s request. Her talk demystified the complex process of getting data into Apache Kafka, often treated as a black box by developers. Using a Hobbit-themed example, Danica traced a producer.send() call from client to broker and back, detailing configurations and metrics that impact performance and reliability. By breaking down serialization, partitioning, batching, and broker-side processing, she equipped developers with tools to debug issues and optimize workflows, making Kafka less intimidating and more approachable.

Preparing the Journey: Serialization and Partitioning

Danica began with a simple schema for tracking Hobbit whereabouts, stored in a topic with six partitions and a replication factor of three. The first step in producing data is serialization, converting objects into bytes for brokers, controlled by key and value serializers. Misconfigurations here can lead to errors, so monitoring serialization metrics is crucial. Next, partitioning determines which partition receives the data. The default partitioner uses a key’s hash or sticky partitioning for keyless records to distribute data evenly. Configurations like partitioner.class, partitioner.ignore.keys, and partitioner.adaptive.partitioning.enable allow fine-tuning, with adaptive partitioning favoring faster brokers to avoid hot partitions, especially in high-throughput scenarios like financial services.

Batching for Efficiency

To optimize throughput, Kafka groups records into batches before sending them to brokers. Danica explained key configurations: batch.size (default 16KB) sets the maximum batch size, while linger.ms (default 0) controls how long to wait to fill a batch. Setting linger.ms above zero introduces latency but reduces broker load by sending fewer requests. buffer.memory (default 32MB) allocates space for batches, and misconfigurations can cause memory issues. Metrics like batch-size-avg, records-per-request-avg, and buffer-available-bytes help monitor batching efficiency, ensuring optimal throughput without overwhelming the client.

Sending the Request: Configurations and Metrics

Once batched, data is sent via a produce request over TCP, with configurations like max.request.size (default 1MB) limiting batch volume and acks determining how many replicas must acknowledge the write. Setting acks to “all” ensures high durability but increases latency, while acks=1 or 0 prioritizes speed. enable.idempotence and transactional.id prevent duplicates, with transactions ensuring consistency across sessions. Metrics like request-rate, requests-in-flight, and request-latency-avg provide visibility into request performance, helping developers identify bottlenecks or overloaded brokers.

Broker-Side Processing: From Socket to Disk

On the broker, requests enter the socket receive buffer, then are processed by network threads (default 3) and added to the request queue. IO threads (default 8) validate data with a cyclic redundancy check and write it to the page cache, later flushing to disk. Configurations like num.network.threads, num.io.threads, and queued.max.requests control thread and queue sizes, with metrics like network-processor-avg-idle-percent and request-handler-avg-idle-percent indicating thread utilization. Data is stored in a commit log with log, index, and snapshot files, supporting efficient retrieval and idempotency. The log.flush.rate and local-time-ms metrics ensure durable storage.

Replication and Response: Completing the Journey

Unfinished requests await replication in a “purgatory” data structure, with follower brokers fetching updates every 500ms (often faster). The remote-time-ms metric tracks replication duration, critical for acks=all. Once replicated, the broker builds a response, handled by network threads and queued in the response queue. Metrics like response-queue-time-ms and total-time-ms measure the full request lifecycle. Danica emphasized that understanding these stages empowers developers to collaborate with operators, tweaking configurations like default.replication.factor or topic-level settings to optimize performance.

Empowering Developers with Kafka Knowledge

Danica concluded by encouraging developers to move beyond treating Kafka as a black box. By mastering configurations and monitoring metrics, they can proactively address issues, from serialization errors to replication delays. Her talk highlighted resources like Confluent Developer for guides and courses on Kafka internals. This knowledge not only simplifies debugging but also fosters better collaboration with operators, ensuring robust, efficient data pipelines.

Links:

Posted in en-US | Tags: Confluent, DanicaFine, DataStreaming, DevoxxBE2024, DistributedSystems, Kafka, PerformanceTuning, Producer | No Comments »

[DevoxxFR2012] Optimizing Resource Utilization: A Deep Dive into JVM, OS, and Hardware Interactions

Author: Jonathan Lalou

Lecturers

Ben Evans and Martijn Verburg are titans of the Java performance community. Ben, co-author of The Well-Grounded Java Developer and a Java Champion, has spent over a decade dissecting JVM internals, GC algorithms, and hardware interactions. Martijn, known as the “Diabolical Developer,” co-leads the London Java User Group, serves on the JCP Executive Committee, and advocates for developer productivity and open-source tooling. Together, they have shaped modern Java performance practices through books, tools, and conference talks that bridge the gap between application code and silicon.

Abstract

This exhaustive exploration revisits Ben Evans and Martijn Verburg’s seminal 2012 DevoxxFR presentation on JVM resource utilization, expanding it with a decade of subsequent advancements. The core thesis remains unchanged: Java’s “write once, run anywhere” philosophy comes at the cost of opacity—developers deploy applications across diverse hardware without understanding how efficiently they consume CPU, memory, power, or I/O. This article dissects the three-layer stack—JVM, Operating System, and Hardware—to reveal how Java applications interact with modern CPUs, memory hierarchies, and power management systems. Through diagnostic tools (jHiccup, SIGAR, JFR), tuning strategies (NUMA awareness, huge pages, GC selection), and cloud-era considerations (vCPU abstraction, noisy neighbors), it provides a comprehensive playbook for achieving 90%+ CPU utilization and minimal power waste. Updated for 2025, this piece incorporates ZGC’s generational mode, Project Loom’s virtual threads, ARM Graviton processors, and green computing initiatives, offering a forward-looking vision for sustainable, high-performance Java in the cloud.

The Abstraction Tax: Why Java Hides Hardware Reality

Java’s portability is its greatest strength and its most significant performance liability. The JVM abstracts away CPU architecture, memory layout, and power states to ensure identical behavior across x86, ARM, and PowerPC. But this abstraction hides critical utilization metrics:
– A Java thread may appear busy but spend 80% of its time in GC pause or context switching.
– A 64-core server running 100 Java processes might achieve only 10% aggregate CPU utilization due to lock contention and GC thrashing.
– Power consumption in data centers—8% of U.S. electricity in 2012, projected at 13% by 2030—is driven by underutilized hardware.

Ben and Martijn argue that visibility is the prerequisite for optimization. Without knowing how resources are used, tuning is guesswork.

Layer 1: The JVM – Where Java Meets the Machine

The HotSpot JVM is a marvel of adaptive optimization, but its default settings prioritize predictability over peak efficiency.

Garbage Collection: The Silent CPU Thief

GC is the largest source of CPU waste in Java applications. Even “low-pause” collectors like CMS introduce stop-the-world phases that halt all application threads.

// Example: CMS GC log
[GC (CMS Initial Mark) 1024K->768K(2048K), 0.0123456 secs]
[Full GC (Allocation Failure) 1800K->1200K(2048K), 0.0987654 secs]

Martijn demonstrates how a 10ms pause every 100ms reduces effective CPU capacity by 10%. In 2025, ZGC and Shenandoah achieve sub-millisecond pauses even at 1TB heaps:

-XX:+UseZGC -XX:ZCollectionInterval=100

JIT Compilation and Code Cache

The JIT compiler generates machine code on-the-fly, but code cache eviction under memory pressure forces recompilation:

-XX:ReservedCodeCacheSize=512m -XX:+PrintCodeCache

Ben recommends tiered compilation (-XX:+TieredCompilation) to balance warmup and peak performance.

Threading and Virtual Threads (2025 Update)

Traditional Java threads map 1:1 to OS threads, incurring 1MB stack overhead and context switch costs. Project Loom introduces virtual threads in Java 21:

try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
    IntStream.range(0, 100_000).forEach(i -> 
        executor.submit(() -> blockingIO()));
}

This enables millions of concurrent tasks with minimal OS overhead, saturating CPU without thread explosion.

Layer 2: The Operating System – Scheduler, Memory, and Power

The OS mediates between JVM and hardware, introducing scheduling, caching, and power management policies.

CPU Scheduling and Affinity

Linux’s CFS scheduler fairly distributes CPU time, but noisy neighbors in multi-tenant environments cause jitter. CPU affinity pins JVMs to cores:

taskset -c 0-7 java -jar app.jar

In NUMA systems, memory locality is critical:

// JNA call to sched_setaffinity

Memory Management: RSS vs. USS

Resident Set Size (RSS) includes shared libraries, inflating perceived usage. Unique Set Size (USS) is more accurate:

smem -t -k -p <pid>

Huge pages reduce TLB misses:

-XX:+UseLargePages -XX:LargePageSizeInBytes=2m

Power Management: P-States and C-States

CPUs dynamically adjust frequency (P-states) and enter sleep (C-states). Java has no direct control, but busy spinning prevents deep sleep:

-XX:+AlwaysPreTouch -XX:+UseNUMA

Layer 3: The Hardware – Cores, Caches, and Power

Modern CPUs are complex hierarchies of cores, caches, and interconnects.

Cache Coherence and False Sharing

Adjacent fields in objects can reside on the same cache line, causing false sharing:

class Counters {
    volatile long c1; // cache line 1
    volatile long c2; // same cache line!
}

Padding or @Contended (Java 8+) resolves this:

@Contended
public class PaddedLong { public volatile long value; }

NUMA and Memory Bandwidth

Non-Uniform Memory Access means local memory is 2–3x faster than remote. JVMs should bind threads to NUMA nodes:

numactl --cpunodebind=0 --membind=0 java -jar app.jar

Diagnostics: Making the Invisible Visible

jHiccup: Measuring Pause Times

java -jar jHiccup.jar -i 1000 -w 5000

Generates histograms of application pauses, revealing GC and OS scheduling hiccups.

Java Flight Recorder (JFR)

-XX:StartFlightRecording=duration=60s,filename=app.jfr

Captures CPU, GC, I/O, and lock contention with <1% overhead.

async-profiler and Flame Graphs

./profiler.sh -e cpu -d 60 -f flame.svg <pid>

Visualizes hot methods and inlining decisions.

Cloud and Green Computing: The Ultimate Utilization Challenge

In cloud environments, vCPUs are abstractions—often half-cores with hyper-threading. Noisy neighbors cause 50%+ variance in performance.

Green Computing Initiatives

Facebook’s Open Compute Project: 38% more efficient servers.
Google’s Borg: 90%+ cluster utilization via bin packing.
ARM Graviton3: 20% better perf/watt than x86.

Spot Markets for Compute (2025 Vision)

Ben and Martijn foresee a commodity market for compute cycles, enabled by:
– Live migration via CRIU.
– Standardized pricing (e.g., $0.001 per CPU-second).
– Java’s portability as the ideal runtime.

Conclusion: Toward a Sustainable Java Future

Evans and Verburg’s central message endures: Utilization is a systems problem. Achieving 90%+ CPU efficiency requires coordination across JVM tuning, OS configuration, and hardware awareness. In 2025, tools like ZGC, Loom, and JFR have made this more achievable than ever, but the principles remain:
– Measure everything (JFR, async-profiler).
– Tune aggressively (GC, NUMA, huge pages).
– Design for the cloud (elastic scaling, spot instances).

By making the invisible visible, Java developers can build faster, cheaper, and greener applications—ensuring Java’s dominance in the cloud-native era.

Links

Posted in en-US | Tags: BenEvans, DevoxxFR2012, GarbageCollection, GreenComputing, JavaFlightRecorder, JVM, MartijnVerburg, NUMA, PerformanceTuning, ProjectLoom, ZGC | No Comments »