Recent Posts
Archives

Posts Tagged ‘ApacheSpark’

PostHeaderIcon [DevoxxBE2023] How Sand and Java Create the World’s Most Powerful Chips

Johan Janssen, an architect at ASML, captivated the DevoxxBE2023 audience with a deep dive into the intricate process of chip manufacturing and the role of Java in optimizing it. Johan, a seasoned speaker and JavaOne Rock Star, explained how ASML’s advanced lithography machines, powered by Java-based software, enable the creation of cutting-edge computer chips used in devices worldwide.

From Sand to Silicon Wafers

Johan began by demystifying chip production, starting with silica sand, an abundant resource transformed into silicon ingots and sliced into wafers. These wafers, approximately 30 cm in diameter, serve as the foundation for chips, hosting up to 600 chips per wafer or thousands for smaller sensors. He passed around a wafer adorned with Java’s mascot, Duke, illustrating the physical substrate of modern electronics.

The process involves printing multiple layers—up to 200—onto wafers using extreme ultraviolet (EUV) lithography machines. These machines, requiring four Boeing 747s for transport, achieve precision at the nanometer scale, with transistors as small as three nanometers. Johan likened this to driving a car 300 km and retracing the path with only 2 mm deviation, highlighting the extraordinary accuracy required.

The Role of EUV Lithography

Johan detailed the EUV lithography process, where tin droplets are hit by a 40-kilowatt laser to generate plasma at sun-like temperatures, producing EUV light. This light, directed by ultra-flat mirrors, patterns wafers through reticles costing €250,000 each. The process demands cleanroom environments, as even a single dust particle can ruin a chip, and involves continuous calibration to maintain precision across thousands of parameters.

ASML’s machines, some over 30 years old, remain in use for producing sensors and less advanced chips, demonstrating their longevity. Johan also previewed future advancements, such as high numerical aperture (NA) machines, which will enable even smaller transistors, further enhancing chip performance and energy efficiency.

Java-Powered Analytics Platform

At the heart of Johan’s talk was ASML’s Java-based analytics platform, which processes 31 terabytes of data weekly to optimize chip production. Built on Apache Spark, the platform distributes computations across worker nodes, supporting plugins for data ingestion, UI customization, and processing. These plugins allow departments to integrate diverse data types, from images to raw measurements, and support languages like Julia and C alongside Java.

The platform, running on-premise to protect sensitive data, consolidates previously disparate applications, improving efficiency and user experience. Johan highlighted a machine learning use case where the platform increased defect detection from 70% to 92% without slowing production, showcasing Java’s role in handling complex computations.

Challenges and Solutions in Chip Manufacturing

Johan discussed challenges like layer misalignment, which can cause short circuits or defective chips. The platform addresses these by analyzing wafer plots to identify correctable errors, such as adjusting subsequent layers to compensate for misalignments. Non-correctable errors may result in downgrading chips (e.g., from 16 GB to 8 GB RAM), ensuring minimal waste.

He emphasized a pragmatic approach to tool selection, starting with REST endpoints and gradually adopting Kafka for streaming data as needs evolved. Johan also noted ASML’s collaboration with tool maintainers to enhance compatibility, such as improving Spark’s progress tracking for customer feedback.

Future of Chip Manufacturing

Looking ahead, Johan highlighted the industry’s push to diversify chip production beyond Taiwan, driven by geopolitical and economic factors. However, building new factories, or “fabs,” costing $10–20 billion, faces challenges like equipment backlogs and the need for highly skilled operators. ASML’s customer support teams, working alongside clients like Intel, underscore the specialized knowledge required.

Johan concluded by stressing the importance of a forward-looking mindset, with ASML’s roadmap prioritizing innovation over rigid methodologies. This approach, combined with Java’s robustness, ensures the platform’s scalability and adaptability in a rapidly evolving industry.

Links:

PostHeaderIcon [ScalaDaysNewYork2016] Connecting Reactive Applications with Fast Data Using Reactive Streams

The rapid evolution of data processing demands systems that can handle real-time information efficiently. At Scala Days New York 2016, Luc Bourlier, a software engineer at Lightbend, delivered an insightful presentation on integrating reactive applications with fast data architectures using Apache Spark and Reactive Streams. Luc demonstrated how Spark Streaming, enhanced with backpressure support in Spark 1.5, enables seamless connectivity between reactive systems and real-time data processing, ensuring responsiveness under varying workloads.

Understanding Fast Data

Luc began by defining fast data as the application of big data tools and algorithms to streaming data, enabling near-instantaneous insights. Unlike traditional big data, which processes stored datasets, fast data focuses on analyzing data as it arrives. Luc illustrated this with a scenario where a business initially runs batch jobs to analyze historical data but soon requires daily, hourly, or even real-time updates to stay competitive. This shift from batch to streaming processing underscores the need for systems that can adapt to dynamic data inflows, a core principle of fast data architectures.

Spark Streaming and Backpressure

Central to Luc’s presentation was Spark Streaming, an extension of Apache Spark designed for real-time data processing. Spark Streaming processes data in mini-batches, allowing it to leverage Spark’s in-memory computation capabilities, a significant advancement over Hadoop’s disk-based MapReduce model. Luc highlighted the introduction of backpressure in Spark 1.5, a feature developed by his team at Lightbend. Backpressure dynamically adjusts the data ingestion rate based on processing capacity, preventing system overload. By analyzing the number of records processed and the time taken in each mini-batch, Spark computes an optimal ingestion rate, ensuring stability even under high data volumes.

Reactive Streams Integration

To connect reactive applications with Spark Streaming, Luc introduced Reactive Streams, a set of Java interfaces designed to facilitate communication between systems with backpressure support. These interfaces allow a reactive application, such as one generating random numbers for a Pi computation demo, to feed data into Spark Streaming without overwhelming the system. Luc demonstrated this integration using a Raspberry Pi cluster, showcasing how backpressure ensures the system remains stable by throttling the data producer when processing lags. This approach maintains responsiveness, a key tenet of reactive systems, by aligning data production with consumption capabilities.

Practical Demonstration and Challenges

Luc’s live demo vividly illustrated the integration process. He presented a dashboard displaying a reactive application computing Pi approximations, with Spark analyzing the generated data in real time. Initially, the system handled 1,000 elements per second efficiently, but as the rate increased to 4,000, processing delays emerged without backpressure, causing data to accumulate in memory. By enabling backpressure, Luc showed how Spark adjusted the ingestion rate, maintaining processing times around one second and preventing system failure. He noted challenges, such as the need to handle variable-sized records, but emphasized that backpressure significantly enhances system reliability.

Future Enhancements

Looking forward, Luc discussed ongoing improvements to Spark’s backpressure mechanism, including better handling of aggregated records and potential integration with Reactive Streams for enhanced pluggability. He encouraged developers to explore Reactive Streams at reactivestreams.org, noting its inclusion in Java 9’s concurrent package. These advancements aim to further streamline the connection between reactive applications and fast data systems, making real-time processing more accessible and robust.

Links:

PostHeaderIcon [ScalaDaysNewYork2016] Spark 2.0: Evolving Big Data Processing with Structured APIs

Apache Spark, a cornerstone in big data processing, has significantly shaped the landscape of distributed computing with its functional programming paradigm rooted in Scala. In a keynote address at Scala Days New York 2016, Matei Zaharia, the creator of Spark, elucidated the evolution of Spark’s APIs, culminating in the transformative release of Spark 2.0. This presentation highlighted how Spark has progressed from its initial vision of a unified engine to a more sophisticated platform with structured APIs like DataFrames and Datasets, enabling enhanced performance and usability for developers worldwide.

The Genesis of Spark’s Vision

Spark was conceived with two primary ambitions: to create a unified engine capable of handling diverse big data workloads and to offer a concise, language-integrated API that mirrors working with local data collections. Matei explained that unlike the earlier MapReduce model, which was groundbreaking yet limited, Spark extended its capabilities to support iterative computations, streaming, and interactive data exploration. This unification was critical, as prior to Spark, developers often juggled multiple specialized systems, each with its own complexities, making integration cumbersome. By leveraging Scala’s functional constructs, Spark introduced Resilient Distributed Datasets (RDDs), allowing developers to perform operations like map, filter, and join with ease, abstracting the complexities of distributed computing.

The success of this vision is evident in Spark’s widespread adoption. With over a thousand organizations deploying it, including on clusters as large as 8,000 nodes, Spark has become the most active open-source big data project. Its libraries for SQL, streaming, machine learning, and graph processing have been embraced, with 75% of surveyed organizations using multiple components, demonstrating the power of its unified approach.

Challenges with the Functional API

Despite its strengths, the original RDD-based API presented challenges, particularly in optimization and efficiency. Matei highlighted that the functional API, while intuitive, conceals the semantics of computations, making it difficult for the engine to optimize operations automatically. For instance, operations like groupByKey can lead to inefficient memory usage, as they materialize large intermediate datasets unnecessarily. This issue is exemplified in a word count example where groupByKey creates a sequence of values before summing them, consuming excessive memory when a simpler reduceByKey could suffice.

Moreover, the reliance on Java objects for data storage introduces significant memory overhead. Matei illustrated this with a user class example, where headers, pointers, and padding consume roughly two-thirds of the allocated memory, a critical concern for in-memory computing frameworks like Spark. These challenges underscored the need for a more structured approach to data processing.

Introducing Structured APIs: DataFrames and Datasets

To address these limitations, Spark introduced DataFrames and Datasets, structured APIs built atop the Spark SQL engine. These APIs impose a defined schema on data, enabling the engine to understand and optimize computations more effectively. DataFrames, dynamically typed, resemble tables in a relational database, supporting operations like filtering and aggregation through a domain-specific language (DSL). Datasets, statically typed, extend this concept by aligning closely with Scala’s type system, allowing developers to work with case classes for type safety.

Matei demonstrated how DataFrames enable declarative programming, where operations are expressed as logical plans that Spark optimizes before execution. For example, filtering users by state generates an abstract syntax tree, allowing Spark to optimize the query plan rather than executing operations eagerly. This declarative nature, inspired by data science tools like Pandas, distinguishes Spark’s DataFrames from similar APIs in R and Python, enhancing performance through lazy evaluation and optimization.

Optimizing Performance with Project Tungsten

A significant focus of Spark 2.0 is Project Tungsten, which addresses the shifting bottlenecks in big data systems. Matei noted that while I/O was the primary constraint in 2010, advancements in storage (SSDs) and networking (10-40 gigabit) have shifted the focus to CPU efficiency. Tungsten employs three strategies: runtime code generation, cache locality exploitation, and off-heap memory management. By encoding data in a compact binary format, Spark reduces memory overhead compared to Java objects. Code generation, facilitated by the Catalyst optimizer, produces specialized bytecode that operates directly on binary data, improving CPU performance. These optimizations ensure Spark can leverage modern hardware trends, delivering significant performance gains.

Structured Streaming: A Unified Approach to Real-Time Processing

Spark 2.0 introduces structured streaming, a high-level API that extends the benefits of DataFrames and Datasets to streaming computations. Matei emphasized that real-world streaming applications often involve batch and interactive workloads, such as updating a database for a web application or applying a machine learning model. Structured streaming treats streams as infinite DataFrames, allowing developers to use familiar APIs to define computations. The engine then incrementally executes these plans, maintaining state and handling late data efficiently. For instance, a batch job grouping data by user ID can be adapted to streaming by changing the input source, with Spark automatically updating results as new data arrives.

This approach simplifies the development of continuous applications, enabling seamless integration of streaming, batch, and interactive processing within a single API, a capability that sets Spark apart from other streaming engines.

Future Directions and Community Engagement

Looking ahead, Matei outlined Spark’s commitment to evolving its APIs while maintaining compatibility. The structured APIs will serve as the foundation for new libraries, facilitating interoperability across languages like Python and R. Additionally, Spark’s data source API allows applications to seamlessly switch between storage systems like Hive, Cassandra, or JSON, enhancing flexibility. Matei also encouraged community participation, noting that Databricks offers a free Community Edition with tutorials to help developers explore Spark’s capabilities.

Links: