Recent Posts
Archives

Posts Tagged ‘LowLatency’

PostHeaderIcon [SpringIO2024] Mind the Gap: Connecting High-Performance Systems at a Leading Crypto Exchange @ Spring I/O 2024

At Spring I/O 2024, Marcos Maia and Lars Werkman from Bitvavo, Europe’s leading cryptocurrency exchange, unveiled the architectural intricacies of their high-performance trading platform. Based in the Netherlands, Bitvavo processes thousands of transactions per second with sub-millisecond latency. Marcos and Lars detailed how they integrate ultra-low-latency systems with Spring Boot applications, offering a deep dive into their strategies for scalability and performance. Their talk, rich with technical insights, challenged conventional software practices, urging developers to rethink performance optimization.

Architecting for Ultra-Low Latency

Marcos opened by highlighting Bitvavo’s mission to enable seamless crypto trading for nearly two million customers. The exchange’s hot path, where orders are processed, demands microsecond response times. To achieve this, Bitvavo employs the Aeron framework, an open-source tool designed for high-performance messaging. By using memory-mapped files, UDP-based communication, and lock-free algorithms, the platform minimizes latency. Marcos explained how they bypass traditional databases, opting for in-memory processing with eventual disk synchronization, ensuring deterministic outcomes critical for trading fairness.

Optimizing the Hot Path

The hot path’s design is uncompromising, as Marcos elaborated. Bitvavo avoids garbage collection by preallocating and reusing objects, ensuring predictable memory usage. Single-threaded processing, counterintuitive to many, leverages CPU caches for nanosecond-level performance. The platform uses distributed state machines, guaranteeing consistent outputs across executions. Lars complemented this by discussing inter-process communication via shared memory and DPDK for kernel-bypassing network operations. These techniques, rooted in decades of trading system expertise, enable Bitvavo to handle peak loads of 30,000 transactions per second.

Bridging with Spring Boot

Integrating high-performance systems with the broader organization poses significant challenges. Marcos detailed the “cold sink,” a Spring Boot application that consumes data from the hot path’s Aeron archive, feeding it into Kafka and MySQL for downstream processing. By batching requests and using object pools, the cold sink minimizes garbage collection, maintaining performance under heavy loads. Fine-tuning batch sizes and applying backpressure ensure the system keeps pace with the hot path’s output, preventing data lags in Bitvavo’s 24/7 operations.

Enhancing JWT Signing Performance

Lars concluded with a case study on optimizing JWT token signing, a “warm path” process targeting sub-millisecond latency. Initially, their RSA-based signing took 8.8 milliseconds, far from the goal. By switching to symmetric HMAC signing and adopting Azul Prime’s JVM, they achieved a 30x performance boost, reaching 260-280 microsecond response times. Lars emphasized the importance of benchmarking with JMH and leveraging Azul’s features like Falcon JIT compiler for stable throughput. This optimization underscores Bitvavo’s commitment to performance across all system layers.

Links:

PostHeaderIcon [DevoxxFR2013] Dispelling Performance Myths in Ultra-High-Throughput Systems

Lecturer

Martin Thompson stands as a preeminent authority in high-performance and low-latency engineering, having accumulated over two decades of expertise across transactional and big-data realms spanning automotive, gaming, financial, mobile, and content management sectors. As co-founder and former CTO of LMAX, he now consults globally, championing mechanical sympathy—the harmonious alignment of software with underlying hardware—to craft elegant, high-velocity solutions. His Disruptor framework exemplifies this philosophy.

Abstract

Martin Thompson systematically dismantles entrenched performance misconceptions through rigorous empirical analysis derived from extreme low-latency environments. Spanning Java and C implementations, third-party libraries, concurrency primitives, and operating system interactions, he promulgates a “measure everything” ethos to illuminate genuine bottlenecks. The discourse dissects garbage collection behaviors, logging overheads, parsing inefficiencies, and hardware utilization, furnishing actionable methodologies to engineer systems delivering millions of operations per second at microsecond latencies.

The Primacy of Empirical Validation: Profiling as the Arbiter of Truth

Thompson underscores that anecdotal wisdom often misleads in performance engineering. Comprehensive profiling under production-representative workloads unveils counterintuitive realities, necessitating continuous measurement with tools like perf, VTune, and async-profiler.

He categorizes fallacies into language-specific, library-induced, concurrency-related, and infrastructure-oriented myths, each substantiated by real-world benchmarks.

Garbage Collection Realities: Tuning for Predictability Over Throughput

A pervasive myth asserts that garbage collection pauses are an inescapable tax, best mitigated by throughput-oriented collectors. Thompson counters that Concurrent Mark-Sweep (CMS) consistently achieves sub-10ms pauses in financial trading systems, whereas G1 frequently doubles minor collection durations due to fragmented region evacuation and reference spidering in cache structures.

Strategic heap sizing to accommodate young generation promotion, coupled with object pooling on critical paths, minimizes pause variability. Direct ByteBuffers, often touted for zero-copy I/O, incur kernel transition penalties; heap-allocated buffers prove superior for modest payloads.

Code-Level Performance Traps: Parsing, Logging, and Allocation Patterns

Parsing dominates CPU cycles in message-driven architectures. XML and JSON deserialization routinely consumes 30-50% of processing time; binary protocols with zero-copy parsers slash this overhead dramatically.

Synchronous logging cripples latency; asynchronous, lock-free appenders built atop ring buffers sustain millions of events per second. Thompson’s Disruptor-based logger exemplifies this, outperforming traditional frameworks by orders of magnitude.

Frequent object allocation triggers premature promotions and GC pressure. Flyweight patterns, preallocation, and stack confinement eliminate heap churn on hot paths.

Concurrency Engineering: Beyond Thread Proliferation

The notion that scaling threads linearly accelerates execution collapses under context-switching and contention costs. Thompson advocates thread affinity to physical cores, aligning counts with hardware topology.

Contented locks serialize execution; lock-free algorithms leveraging compare-and-swap (CAS) preserve parallelism. False sharing—cache line ping-pong between adjacent variables—devastates throughput; 64-byte padding ensures isolation.

Infrastructure Optimization: OS, Network, and Storage Synergy

Operating system tuning involves interrupt coalescing, huge pages to reduce TLB misses, and scheduler affinity. Network kernel bypass (e.g., Solarflare OpenOnload) shaves microseconds from round-trip times.

Storage demands asynchronous I/O and batching; fsync calls must be minimized or offloaded to dedicated threads. SSD sequential writes eclipse HDDs, but random access patterns require careful buffering.

Cultural and Methodological Shifts for Sustained Performance

Thompson exhorts engineering teams to institutionalize profiling, automate benchmarks, and challenge assumptions relentlessly. The Disruptor’s single-writer principle, mechanical sympathy, and batching yield over six million operations per second on commodity hardware.

Performance is not an afterthought but an architectural cornerstone, demanding cross-disciplinary hardware-software coherence.

Links: