Posts Tagged ‘GarbageCollection’
[KotlinConf2025] The Life and Death of a Kotlin Native Object
The journey of an object within a computer’s memory is a topic that is often obscured from the everyday developer. In a highly insightful session, Troels Lund, a leader on the Kotlin/Native team at Google, delves into the intricacies of what transpires behind the scenes when an object is instantiated and subsequently discarded within the Kotlin/Native runtime. This detailed examination provides a compelling look at a subject that is usually managed automatically, demonstrating the sophisticated mechanisms at play to ensure efficient memory management and robust application performance.
The Inner Workings of the Runtime
Lund begins by exploring the foundational elements of the Kotlin/Native runtime, highlighting its role in bridging the gap between high-level Kotlin code and the native environment. The runtime is responsible for a variety of critical tasks, including memory layout, garbage collection, and managing object lifecycles. One of the central tenets of this system is its ability to handle memory allocation and deallocation with minimal developer intervention. The talk illustrates how an object’s structure is precisely defined in memory, a crucial step for both performance and predictability. This low-level perspective offered a new appreciation for the seamless operation that developers have come to expect.
A Deep Dive into Garbage Collection
The talk then progresses to the sophisticated mechanisms of garbage collection. A deep dive into the Kotlin/Native memory model reveals a system designed for both performance and concurrency. Lund describes the dual approach of a parallel mark and concurrent sweep and a concurrent mark and sweep. The parallel mark and concurrent sweep is designed to maximize throughput by parallelizing the marking phase, while the concurrent mark and sweep aims to minimize pause times by allowing the sweeping phase to happen alongside application execution. The session details how these processes identify and reclaim memory from objects that are no longer in use, preventing memory leaks and maintaining system stability. The discussion also touches upon weak references and their role in memory management. Lund explains how these references are cleared out in a timely manner, ensuring that objects that should be garbage-collected are not resurrected.
Final Thoughts on the Runtime
In his concluding remarks, Lund offers a final summary of the Kotlin/Native runtime. He reiterates that this is a snapshot of what is happening now, and that the details are subject to change over time as new features are added and existing ones are optimized. He emphasizes that the goal of the team is to ensure that the developer experience is as smooth and effortless as possible, with the intricate details of memory management handled transparently by the runtime. The session serves as a powerful reminder of the complex engineering that underpins the simplicity and elegance of the Kotlin language, particularly in its native context.
Links:
[DevoxxFR2025] Boosting Java Application Startup Time: JVM and Framework Optimizations
In the world of modern application deployment, particularly in cloud-native and microservice architectures, fast startup time is a crucial factor impacting scalability, resilience, and cost efficiency. Slow-starting applications can delay deployments, hinder auto-scaling responsiveness, and consume resources unnecessarily. Olivier Bourgain, in his presentation, delved into strategies for significantly accelerating the startup time of Java applications, focusing on optimizations at both the Java Virtual Machine (JVM) level and within popular frameworks like Spring Boot. He explored techniques ranging from garbage collection tuning to leveraging emerging technologies like OpenJDK’s Project Leyden and Spring AOT (Ahead-of-Time Compilation) to make Java applications lighter, faster, and more efficient from the moment they start.
The Importance of Fast Startup
Olivier began by explaining why fast startup time matters in modern environments. In microservices architectures, applications are frequently started and stopped as part of scaling events, deployments, or rolling updates. A slow startup adds to the time it takes to scale up to handle increased load, potentially leading to performance degradation or service unavailability. In serverless or function-as-a-service environments, cold starts (the time it takes for an idle instance to become ready) are directly impacted by application startup time, affecting latency and user experience. Faster startup also improves developer productivity by reducing the waiting time during local development and testing cycles. Olivier emphasized that optimizing startup time is no longer just a minor optimization but a fundamental requirement for efficient cloud-native deployments.
JVM and Garbage Collection Optimizations
Optimizing the JVM configuration and understanding garbage collection behavior are foundational steps in improving Java application startup. Olivier discussed how different garbage collectors (like G1, Parallel, or ZGC) can impact startup time and memory usage. Tuning JVM arguments related to heap size, garbage collection pauses, and just-in-time (JIT) compilation tiers can influence how quickly the application becomes responsive. While JIT compilation is crucial for long-term performance, it can introduce startup overhead as the JVM analyzes and optimizes code during initial execution. Techniques like Class Data Sharing (CDS) were mentioned as a way to reduce startup time by sharing pre-processed class metadata between multiple JVM instances. Olivier provided practical tips and configurations for optimizing JVM settings specifically for faster startup, balancing it with overall application performance.
Framework Optimizations: Spring Boot and Beyond
Popular frameworks like Spring Boot, while providing immense productivity benefits, can sometimes contribute to longer startup times due to their extensive features and reliance on reflection and classpath scanning during initialization. Olivier explored strategies within the Spring ecosystem and other frameworks to mitigate this. He highlighted Spring AOT (Ahead-of-Time Compilation) as a transformative technology that analyzes the application at build time and generates optimized code and configuration, reducing the work the JVM needs to do at runtime. This can significantly decrease startup time and memory footprint, making Spring Boot applications more suitable for resource-constrained environments and serverless deployments. Project Leyden in OpenJDK, aiming to enable static images and further AOT compilation for Java, was also discussed as a future direction for improving startup performance at the language level. Olivier demonstrated how applying these framework-specific optimizations and leveraging AOT compilation can have a dramatic impact on the startup speed of Java applications, making them competitive with applications written in languages traditionally known for faster startup.
Links:
- Olivier Bourgain: https://www.linkedin.com/in/olivier-bourgain/
- Mirakl: https://www.mirakl.com/
- Spring Boot: https://spring.io/projects/spring-boot
- OpenJDK Project Leyden: https://openjdk.org/projects/leyden/
- Devoxx France LinkedIn: https://www.linkedin.com/company/devoxx-france/
- Devoxx France Bluesky: https://bsky.app/profile/devoxx.fr
- Devoxx France Website: https://www.devoxx.fr/
[DotJs2025] Node.js Will Use All the Memory Available, and That’s OK!
In the pulsating heart of server-side JavaScript, where applications hum under relentless loads, a persistent myth endures: Node.js’s voracious appetite for RAM signals impending doom. Matteo Collina, co-founder and CTO at Platformatic, dismantled this notion at dotJS 2025, revealing how V8’s sophisticated heap stewardship—far from a liability—empowers resilient, high-throughput services. With over 15 years sculpting performant ecosystems, including Fastify’s lean framework and Pino’s swift logging, Matteo illuminated the elegance of embracing memory as a strategic asset, not an adversary. His revelation: judicious tuning transforms perceived excess into a catalyst for latency gains and stability, urging developers to recalibrate preconceptions for enterprise-grade robustness.
Matteo commenced with a ritual lament: weekly pleas from harried coders convinced their apps hemorrhage resources, only to confess manual terminations at arbitrary thresholds—no crashes, merely preempted panics. This vignette unveiled the crux: Node’s default 1.4GB cap (64-bit) isn’t a leak’s harbinger but a deliberate throttle, safeguarding against unchecked sprawl. True leaks—orphaned closures, eternal event emitters—defy GC’s mercy, accruing via retain cycles. Yet, most “leaks” masquerade as legitimate growth: caches bloating under traffic, buffers queuing async floods. Matteo advocated profiling primacy: Chrome DevTools’ heap snapshots, clinic.js’s flame charts—tools unmasking culprits sans conjecture.
Delving into V8’s bowels, Matteo traced the Orinoco collector’s cadence: minor sweeps scavenging new-space detritus, majors consolidating old-space survivors. Latency lurks in these pauses; unchecked heaps amplify them, stalling event loops. His panacea: hoist the ceiling via --max-old-space-size=4096, bartering RAM for elongated intervals between majors. Benchmarks corroborated: a 4GB tweak on a Fastify benchmark slashed P99 latency by 8-10%, throughput surging analogously—thinner GC curves yielding smoother sails. This alchemy, Matteo posited, flips economics: memory’s abundance (cloud’s elastic reservoirs) trumps compute’s scarcity, especially as SSDs eclipse HDDs in I/O velocity.
Enterprise vignettes abounded. Platformatic’s observability suite, Pino’s zero-allocation streams—testaments to lean design—thrive sans austerity. Matteo cautioned: leaks persist, demanding vigilance—nullify globals, prune listeners, wield weak maps for caches. Yet, fear not the fullness; it’s V8’s vote of confidence in your workload’s vitality. As Kubernetes autoscalers and monitoring recipes (his forthcoming tome’s bounty) democratize, Node’s memory ethos evolves from taboo to triumph.
Demystifying Heaps and Collectors
Matteo dissected V8’s realms: new-space for ephemeral allocations, old-space for tenured stalwarts—Orinoco’s incremental majors mitigating stalls. Defaults constrain; elevations liberate, as 2025’s guides affirm: monitor via --inspect, profile with heapdump.js, tuning for 10% latency dividends sans leaks.
Trading Bytes for Bandwidth
Empirical edges: Fastify’s trials evince heap hikes yielding throughput boons, GC pauses pruned. Platformatic’s ethos—frictionless backends—embodies this: Pino’s streams, Fastify’s routers, all memory-savvy. Matteo’s gift: enterprise blueprints, from K8s scaling to on-prem Next.js, in his 296-page manifesto.
Links:
[NodeCongress2021] Nodejs Runtime Performance Tips – Yonatan Kra
Amidst the clamor of high-stakes deployments, where milliseconds dictate user satisfaction and fiscal prudence, refining Node.js execution emerges as a paramount pursuit. Yonatan Kra, software architect at Vonage and avid runner, recounts a pivotal incident—a customer’s frantic call amid a faltering microservice, where a lone sluggish routine ballooned latencies from instants to eternities. This anecdote catalyzes his compendium of runtime enhancements, gleaned from battle-tested optimizations.
Yonatan initiates with diagnostic imperatives: Chrome DevTools’ performance tab chronicles timelines, flagging CPU-intensive spans. A contrived endpoint—filtering arrays via nested loops—exemplifies: record traces reveal 2-3 second overruns, dissected via flame charts into redundant iterations. Remedies abound: hoist computations outside loops, leveraging const for immutables; Array.prototype.filter supplants bespoke sieves, slashing cycles by orders.
Garbage collection looms large; Yonatan probes heap snapshots, unveiling undisposed allocations. An interval emitter appending to external arrays evades reclamation, manifesting as persistent blue bars—unfreed parcels. Mitigation: nullify references post-use, invoking gc() in debug modes for verification; gray hues signal success, affirming leak abatement.
Profiling Memory and Function Bottlenecks
Memory profiling extends to production shadows: –inspect flags remote sessions, timeline instrumentation captures allocations sans pauses. Yonatan demos: API invocations spawn specials, uncollected until array clears, transforming azure spikes to ephemeral grays. For functions, Postman sequences gauge holistically—from ingress to egress—isolating laggards for surgical tweaks.
Yonatan dispels myths: performance isn’t arcane sorcery but empirical iteration—profile relentlessly, optimize judiciously. His zeal, born of crises, equips Node.js stewards to forge nimble, leak-free realms, where clouds yield dividends and users endure no stutter.
Links:
[DevoxxFR2013] Dispelling Performance Myths in Ultra-High-Throughput Systems
Lecturer
Martin Thompson stands as a preeminent authority in high-performance and low-latency engineering, having accumulated over two decades of expertise across transactional and big-data realms spanning automotive, gaming, financial, mobile, and content management sectors. As co-founder and former CTO of LMAX, he now consults globally, championing mechanical sympathy—the harmonious alignment of software with underlying hardware—to craft elegant, high-velocity solutions. His Disruptor framework exemplifies this philosophy.
Abstract
Martin Thompson systematically dismantles entrenched performance misconceptions through rigorous empirical analysis derived from extreme low-latency environments. Spanning Java and C implementations, third-party libraries, concurrency primitives, and operating system interactions, he promulgates a “measure everything” ethos to illuminate genuine bottlenecks. The discourse dissects garbage collection behaviors, logging overheads, parsing inefficiencies, and hardware utilization, furnishing actionable methodologies to engineer systems delivering millions of operations per second at microsecond latencies.
The Primacy of Empirical Validation: Profiling as the Arbiter of Truth
Thompson underscores that anecdotal wisdom often misleads in performance engineering. Comprehensive profiling under production-representative workloads unveils counterintuitive realities, necessitating continuous measurement with tools like perf, VTune, and async-profiler.
He categorizes fallacies into language-specific, library-induced, concurrency-related, and infrastructure-oriented myths, each substantiated by real-world benchmarks.
Garbage Collection Realities: Tuning for Predictability Over Throughput
A pervasive myth asserts that garbage collection pauses are an inescapable tax, best mitigated by throughput-oriented collectors. Thompson counters that Concurrent Mark-Sweep (CMS) consistently achieves sub-10ms pauses in financial trading systems, whereas G1 frequently doubles minor collection durations due to fragmented region evacuation and reference spidering in cache structures.
Strategic heap sizing to accommodate young generation promotion, coupled with object pooling on critical paths, minimizes pause variability. Direct ByteBuffers, often touted for zero-copy I/O, incur kernel transition penalties; heap-allocated buffers prove superior for modest payloads.
Code-Level Performance Traps: Parsing, Logging, and Allocation Patterns
Parsing dominates CPU cycles in message-driven architectures. XML and JSON deserialization routinely consumes 30-50% of processing time; binary protocols with zero-copy parsers slash this overhead dramatically.
Synchronous logging cripples latency; asynchronous, lock-free appenders built atop ring buffers sustain millions of events per second. Thompson’s Disruptor-based logger exemplifies this, outperforming traditional frameworks by orders of magnitude.
Frequent object allocation triggers premature promotions and GC pressure. Flyweight patterns, preallocation, and stack confinement eliminate heap churn on hot paths.
Concurrency Engineering: Beyond Thread Proliferation
The notion that scaling threads linearly accelerates execution collapses under context-switching and contention costs. Thompson advocates thread affinity to physical cores, aligning counts with hardware topology.
Contented locks serialize execution; lock-free algorithms leveraging compare-and-swap (CAS) preserve parallelism. False sharing—cache line ping-pong between adjacent variables—devastates throughput; 64-byte padding ensures isolation.
Infrastructure Optimization: OS, Network, and Storage Synergy
Operating system tuning involves interrupt coalescing, huge pages to reduce TLB misses, and scheduler affinity. Network kernel bypass (e.g., Solarflare OpenOnload) shaves microseconds from round-trip times.
Storage demands asynchronous I/O and batching; fsync calls must be minimized or offloaded to dedicated threads. SSD sequential writes eclipse HDDs, but random access patterns require careful buffering.
Cultural and Methodological Shifts for Sustained Performance
Thompson exhorts engineering teams to institutionalize profiling, automate benchmarks, and challenge assumptions relentlessly. The Disruptor’s single-writer principle, mechanical sympathy, and batching yield over six million operations per second on commodity hardware.
Performance is not an afterthought but an architectural cornerstone, demanding cross-disciplinary hardware-software coherence.
Links:
[DevoxxFR2012] Optimizing Resource Utilization: A Deep Dive into JVM, OS, and Hardware Interactions
Lecturers
Ben Evans and Martijn Verburg are titans of the Java performance community. Ben, co-author of The Well-Grounded Java Developer and a Java Champion, has spent over a decade dissecting JVM internals, GC algorithms, and hardware interactions. Martijn, known as the “Diabolical Developer,” co-leads the London Java User Group, serves on the JCP Executive Committee, and advocates for developer productivity and open-source tooling. Together, they have shaped modern Java performance practices through books, tools, and conference talks that bridge the gap between application code and silicon.
Abstract
This exhaustive exploration revisits Ben Evans and Martijn Verburg’s seminal 2012 DevoxxFR presentation on JVM resource utilization, expanding it with a decade of subsequent advancements. The core thesis remains unchanged: Java’s “write once, run anywhere” philosophy comes at the cost of opacity—developers deploy applications across diverse hardware without understanding how efficiently they consume CPU, memory, power, or I/O. This article dissects the three-layer stack—JVM, Operating System, and Hardware—to reveal how Java applications interact with modern CPUs, memory hierarchies, and power management systems. Through diagnostic tools (jHiccup, SIGAR, JFR), tuning strategies (NUMA awareness, huge pages, GC selection), and cloud-era considerations (vCPU abstraction, noisy neighbors), it provides a comprehensive playbook for achieving 90%+ CPU utilization and minimal power waste. Updated for 2025, this piece incorporates ZGC’s generational mode, Project Loom’s virtual threads, ARM Graviton processors, and green computing initiatives, offering a forward-looking vision for sustainable, high-performance Java in the cloud.
The Abstraction Tax: Why Java Hides Hardware Reality
Java’s portability is its greatest strength and its most significant performance liability. The JVM abstracts away CPU architecture, memory layout, and power states to ensure identical behavior across x86, ARM, and PowerPC. But this abstraction hides critical utilization metrics:
– A Java thread may appear busy but spend 80% of its time in GC pause or context switching.
– A 64-core server running 100 Java processes might achieve only 10% aggregate CPU utilization due to lock contention and GC thrashing.
– Power consumption in data centers—8% of U.S. electricity in 2012, projected at 13% by 2030—is driven by underutilized hardware.
Ben and Martijn argue that visibility is the prerequisite for optimization. Without knowing how resources are used, tuning is guesswork.
Layer 1: The JVM – Where Java Meets the Machine
The HotSpot JVM is a marvel of adaptive optimization, but its default settings prioritize predictability over peak efficiency.
Garbage Collection: The Silent CPU Thief
GC is the largest source of CPU waste in Java applications. Even “low-pause” collectors like CMS introduce stop-the-world phases that halt all application threads.
// Example: CMS GC log
[GC (CMS Initial Mark) 1024K->768K(2048K), 0.0123456 secs]
[Full GC (Allocation Failure) 1800K->1200K(2048K), 0.0987654 secs]
Martijn demonstrates how a 10ms pause every 100ms reduces effective CPU capacity by 10%. In 2025, ZGC and Shenandoah achieve sub-millisecond pauses even at 1TB heaps:
-XX:+UseZGC -XX:ZCollectionInterval=100
JIT Compilation and Code Cache
The JIT compiler generates machine code on-the-fly, but code cache eviction under memory pressure forces recompilation:
-XX:ReservedCodeCacheSize=512m -XX:+PrintCodeCache
Ben recommends tiered compilation (-XX:+TieredCompilation) to balance warmup and peak performance.
Threading and Virtual Threads (2025 Update)
Traditional Java threads map 1:1 to OS threads, incurring 1MB stack overhead and context switch costs. Project Loom introduces virtual threads in Java 21:
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
IntStream.range(0, 100_000).forEach(i ->
executor.submit(() -> blockingIO()));
}
This enables millions of concurrent tasks with minimal OS overhead, saturating CPU without thread explosion.
Layer 2: The Operating System – Scheduler, Memory, and Power
The OS mediates between JVM and hardware, introducing scheduling, caching, and power management policies.
CPU Scheduling and Affinity
Linux’s CFS scheduler fairly distributes CPU time, but noisy neighbors in multi-tenant environments cause jitter. CPU affinity pins JVMs to cores:
taskset -c 0-7 java -jar app.jar
In NUMA systems, memory locality is critical:
// JNA call to sched_setaffinity
Memory Management: RSS vs. USS
Resident Set Size (RSS) includes shared libraries, inflating perceived usage. Unique Set Size (USS) is more accurate:
smem -t -k -p <pid>
Huge pages reduce TLB misses:
-XX:+UseLargePages -XX:LargePageSizeInBytes=2m
Power Management: P-States and C-States
CPUs dynamically adjust frequency (P-states) and enter sleep (C-states). Java has no direct control, but busy spinning prevents deep sleep:
-XX:+AlwaysPreTouch -XX:+UseNUMA
Layer 3: The Hardware – Cores, Caches, and Power
Modern CPUs are complex hierarchies of cores, caches, and interconnects.
Cache Coherence and False Sharing
Adjacent fields in objects can reside on the same cache line, causing false sharing:
class Counters {
volatile long c1; // cache line 1
volatile long c2; // same cache line!
}
Padding or @Contended (Java 8+) resolves this:
@Contended
public class PaddedLong { public volatile long value; }
NUMA and Memory Bandwidth
Non-Uniform Memory Access means local memory is 2–3x faster than remote. JVMs should bind threads to NUMA nodes:
numactl --cpunodebind=0 --membind=0 java -jar app.jar
Diagnostics: Making the Invisible Visible
jHiccup: Measuring Pause Times
java -jar jHiccup.jar -i 1000 -w 5000
Generates histograms of application pauses, revealing GC and OS scheduling hiccups.
Java Flight Recorder (JFR)
-XX:StartFlightRecording=duration=60s,filename=app.jfr
Captures CPU, GC, I/O, and lock contention with <1% overhead.
async-profiler and Flame Graphs
./profiler.sh -e cpu -d 60 -f flame.svg <pid>
Visualizes hot methods and inlining decisions.
Cloud and Green Computing: The Ultimate Utilization Challenge
In cloud environments, vCPUs are abstractions—often half-cores with hyper-threading. Noisy neighbors cause 50%+ variance in performance.
Green Computing Initiatives
- Facebook’s Open Compute Project: 38% more efficient servers.
- Google’s Borg: 90%+ cluster utilization via bin packing.
- ARM Graviton3: 20% better perf/watt than x86.
Spot Markets for Compute (2025 Vision)
Ben and Martijn foresee a commodity market for compute cycles, enabled by:
– Live migration via CRIU.
– Standardized pricing (e.g., $0.001 per CPU-second).
– Java’s portability as the ideal runtime.
Conclusion: Toward a Sustainable Java Future
Evans and Verburg’s central message endures: Utilization is a systems problem. Achieving 90%+ CPU efficiency requires coordination across JVM tuning, OS configuration, and hardware awareness. In 2025, tools like ZGC, Loom, and JFR have made this more achievable than ever, but the principles remain:
– Measure everything (JFR, async-profiler).
– Tune aggressively (GC, NUMA, huge pages).
– Design for the cloud (elastic scaling, spot instances).
By making the invisible visible, Java developers can build faster, cheaper, and greener applications—ensuring Java’s dominance in the cloud-native era.