Jonathan Lalou's Blog

[DevoxxFR2012] Optimizing Resource Utilization: A Deep Dive into JVM, OS, and Hardware Interactions

Author: Jonathan Lalou | October 16, 2012

Lecturers

Ben Evans and Martijn Verburg are titans of the Java performance community. Ben, co-author of The Well-Grounded Java Developer and a Java Champion, has spent over a decade dissecting JVM internals, GC algorithms, and hardware interactions. Martijn, known as the “Diabolical Developer,” co-leads the London Java User Group, serves on the JCP Executive Committee, and advocates for developer productivity and open-source tooling. Together, they have shaped modern Java performance practices through books, tools, and conference talks that bridge the gap between application code and silicon.

Abstract

This exhaustive exploration revisits Ben Evans and Martijn Verburg’s seminal 2012 DevoxxFR presentation on JVM resource utilization, expanding it with a decade of subsequent advancements. The core thesis remains unchanged: Java’s “write once, run anywhere” philosophy comes at the cost of opacity—developers deploy applications across diverse hardware without understanding how efficiently they consume CPU, memory, power, or I/O. This article dissects the three-layer stack—JVM, Operating System, and Hardware—to reveal how Java applications interact with modern CPUs, memory hierarchies, and power management systems. Through diagnostic tools (jHiccup, SIGAR, JFR), tuning strategies (NUMA awareness, huge pages, GC selection), and cloud-era considerations (vCPU abstraction, noisy neighbors), it provides a comprehensive playbook for achieving 90%+ CPU utilization and minimal power waste. Updated for 2025, this piece incorporates ZGC’s generational mode, Project Loom’s virtual threads, ARM Graviton processors, and green computing initiatives, offering a forward-looking vision for sustainable, high-performance Java in the cloud.

The Abstraction Tax: Why Java Hides Hardware Reality

Java’s portability is its greatest strength and its most significant performance liability. The JVM abstracts away CPU architecture, memory layout, and power states to ensure identical behavior across x86, ARM, and PowerPC. But this abstraction hides critical utilization metrics:
– A Java thread may appear busy but spend 80% of its time in GC pause or context switching.
– A 64-core server running 100 Java processes might achieve only 10% aggregate CPU utilization due to lock contention and GC thrashing.
– Power consumption in data centers—8% of U.S. electricity in 2012, projected at 13% by 2030—is driven by underutilized hardware.

Ben and Martijn argue that visibility is the prerequisite for optimization. Without knowing how resources are used, tuning is guesswork.

Layer 1: The JVM – Where Java Meets the Machine

The HotSpot JVM is a marvel of adaptive optimization, but its default settings prioritize predictability over peak efficiency.

Garbage Collection: The Silent CPU Thief

GC is the largest source of CPU waste in Java applications. Even “low-pause” collectors like CMS introduce stop-the-world phases that halt all application threads.

// Example: CMS GC log
[GC (CMS Initial Mark) 1024K->768K(2048K), 0.0123456 secs]
[Full GC (Allocation Failure) 1800K->1200K(2048K), 0.0987654 secs]

Martijn demonstrates how a 10ms pause every 100ms reduces effective CPU capacity by 10%. In 2025, ZGC and Shenandoah achieve sub-millisecond pauses even at 1TB heaps:

-XX:+UseZGC -XX:ZCollectionInterval=100

JIT Compilation and Code Cache

The JIT compiler generates machine code on-the-fly, but code cache eviction under memory pressure forces recompilation:

-XX:ReservedCodeCacheSize=512m -XX:+PrintCodeCache

Ben recommends tiered compilation (-XX:+TieredCompilation) to balance warmup and peak performance.

Threading and Virtual Threads (2025 Update)

Traditional Java threads map 1:1 to OS threads, incurring 1MB stack overhead and context switch costs. Project Loom introduces virtual threads in Java 21:

try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
    IntStream.range(0, 100_000).forEach(i -> 
        executor.submit(() -> blockingIO()));
}

This enables millions of concurrent tasks with minimal OS overhead, saturating CPU without thread explosion.

Layer 2: The Operating System – Scheduler, Memory, and Power

The OS mediates between JVM and hardware, introducing scheduling, caching, and power management policies.

CPU Scheduling and Affinity

Linux’s CFS scheduler fairly distributes CPU time, but noisy neighbors in multi-tenant environments cause jitter. CPU affinity pins JVMs to cores:

taskset -c 0-7 java -jar app.jar

In NUMA systems, memory locality is critical:

// JNA call to sched_setaffinity

Memory Management: RSS vs. USS

Resident Set Size (RSS) includes shared libraries, inflating perceived usage. Unique Set Size (USS) is more accurate:

smem -t -k -p <pid>

Huge pages reduce TLB misses:

-XX:+UseLargePages -XX:LargePageSizeInBytes=2m

Power Management: P-States and C-States

CPUs dynamically adjust frequency (P-states) and enter sleep (C-states). Java has no direct control, but busy spinning prevents deep sleep:

-XX:+AlwaysPreTouch -XX:+UseNUMA

Layer 3: The Hardware – Cores, Caches, and Power

Modern CPUs are complex hierarchies of cores, caches, and interconnects.

Cache Coherence and False Sharing

Adjacent fields in objects can reside on the same cache line, causing false sharing:

class Counters {
    volatile long c1; // cache line 1
    volatile long c2; // same cache line!
}

Padding or @Contended (Java 8+) resolves this:

@Contended
public class PaddedLong { public volatile long value; }

NUMA and Memory Bandwidth

Non-Uniform Memory Access means local memory is 2–3x faster than remote. JVMs should bind threads to NUMA nodes:

numactl --cpunodebind=0 --membind=0 java -jar app.jar

Diagnostics: Making the Invisible Visible

jHiccup: Measuring Pause Times

java -jar jHiccup.jar -i 1000 -w 5000

Generates histograms of application pauses, revealing GC and OS scheduling hiccups.

Java Flight Recorder (JFR)

-XX:StartFlightRecording=duration=60s,filename=app.jfr

Captures CPU, GC, I/O, and lock contention with <1% overhead.

async-profiler and Flame Graphs

./profiler.sh -e cpu -d 60 -f flame.svg <pid>

Visualizes hot methods and inlining decisions.

Cloud and Green Computing: The Ultimate Utilization Challenge

In cloud environments, vCPUs are abstractions—often half-cores with hyper-threading. Noisy neighbors cause 50%+ variance in performance.

Green Computing Initiatives

Facebook’s Open Compute Project: 38% more efficient servers.
Google’s Borg: 90%+ cluster utilization via bin packing.
ARM Graviton3: 20% better perf/watt than x86.

Spot Markets for Compute (2025 Vision)

Ben and Martijn foresee a commodity market for compute cycles, enabled by:
– Live migration via CRIU.
– Standardized pricing (e.g., $0.001 per CPU-second).
– Java’s portability as the ideal runtime.

Conclusion: Toward a Sustainable Java Future

Evans and Verburg’s central message endures: Utilization is a systems problem. Achieving 90%+ CPU efficiency requires coordination across JVM tuning, OS configuration, and hardware awareness. In 2025, tools like ZGC, Loom, and JFR have made this more achievable than ever, but the principles remain:
– Measure everything (JFR, async-profiler).
– Tune aggressively (GC, NUMA, huge pages).
– Design for the cloud (elastic scaling, spot instances).

By making the invisible visible, Java developers can build faster, cheaper, and greener applications—ensuring Java’s dominance in the cloud-native era.

Links

Posted in en-US | Tags: BenEvans, DevoxxFR2012, GarbageCollection, GreenComputing, JavaFlightRecorder, JVM, MartijnVerburg, NUMA, PerformanceTuning, ProjectLoom, ZGC