Jonathan Lalou's Blog

Posts Tagged ‘ScalaDaysNewYork2016’

[ScalaDaysNewYork2016] Scala’s Road Ahead: Shaping the Future of a Versatile Language

Scala, a language renowned for blending functional and object-oriented programming, stands at a pivotal juncture as outlined by its creator, Martin Odersky, in his keynote at Scala Days New York 2016. Martin’s address explored Scala’s unique identity, recent developments like Scala 2.12 and the Scala Center, and the experimental Dotty compiler, offering a vision for the language’s evolution over the next five years. This talk underscored Scala’s commitment to balancing simplicity, power, and theoretical rigor while addressing community needs.

Scala’s Recent Milestones

Martin began by reflecting on Scala’s steady growth, evidenced by increasing job postings and Google Trends for Scala tutorials. The establishment of the Scala Center marks a significant milestone, providing a hub for community collaboration with support from industry leaders like Lightbend and Goldman Sachs. Additionally, Scala 2.12, set for release in mid-2016, optimizes for Java 8, leveraging lambdas and default methods to produce more compact and faster code. This release, with 33 new features and contributions from 65 committers, reflects Scala’s vibrant community and commitment to progress.

The Scala Center: Fostering Community Collaboration

The Scala Center, as Martin described, serves as a steward for Scala, focusing on projects that benefit the entire community. By coordinating contributions and fostering industrial partnerships, it aims to streamline development and ensure Scala’s longevity. While Martin deferred detailed discussion to Heather Miller’s keynote, he emphasized the center’s role in unifying efforts to enhance Scala’s ecosystem, making it a cornerstone for future growth.

Dotty: A New Foundation for Scala

Central to Martin’s vision is Dotty, a new Scala compiler built on the Dependent Object Types (DOT) calculus. This theoretical foundation, proven sound after an eight-year effort, provides a robust basis for evaluating new language features. Dotty, with a leaner codebase of 45,000 lines compared to the current compiler’s 75,000, offers faster compilation and simplifies the language’s internals by encoding complex features like type parameters into a minimal subset. This approach enhances confidence in language evolution, allowing developers to experiment with new constructs without compromising stability.

Evolving Scala’s Libraries

Looking beyond Scala 2.12, Martin outlined plans for Scala 2.13, focusing on revamping the standard library, particularly collections. Inspired by Spark’s lazy evaluation and pair datasets, Scala aims to simplify collections while maintaining compatibility. Proposals include splitting the library into a core module, containing essentials like collections, and a platform module for additional functionalities like JSON handling. This modular approach would enable dynamic updates and broader community contributions, addressing the challenges of maintaining a monolithic library.

Addressing Language Complexity

Martin acknowledged Scala’s reputation for complexity, particularly with features like implicits, which, while powerful, can lead to unexpected behavior if misused. To mitigate this, he proposed style guidelines, such as the principle of least power, encouraging developers to use the simplest constructs necessary. Additionally, he suggested enforcing rules for implicit conversions, limiting them to packages containing the source or target types to reduce surprises. These measures aim to balance Scala’s flexibility with usability, ensuring it remains approachable.

Future Innovations: Simplifying and Strengthening Scala

Martin’s vision for Scala includes several forward-looking features. Implicit function types will reduce boilerplate by abstracting over implicit parameters, while effect systems will treat side effects like exceptions as capabilities, enhancing type safety. Nullable types, modeled as union types, address Scala’s null-related issues, aligning it with modern languages like Kotlin. Generic programming improvements, inspired by libraries like Shapeless, aim to eliminate tuple limitations, and better records will support data engines like Spark. These innovations, grounded in Dotty’s foundations, promise a more robust and intuitive Scala.

Links:

Posted in en-US | Tags: Dotty, FunctionalProgramming, Lightbend, MartinOdersky, programminglanguages, Scala, ScalaCenter, ScalaDaysNewYork2016 | No Comments »

[ScalaDaysNewYork2016] Spark 2.0: Evolving Big Data Processing with Structured APIs

Author: Jonathan Lalou

Apache Spark, a cornerstone in big data processing, has significantly shaped the landscape of distributed computing with its functional programming paradigm rooted in Scala. In a keynote address at Scala Days New York 2016, Matei Zaharia, the creator of Spark, elucidated the evolution of Spark’s APIs, culminating in the transformative release of Spark 2.0. This presentation highlighted how Spark has progressed from its initial vision of a unified engine to a more sophisticated platform with structured APIs like DataFrames and Datasets, enabling enhanced performance and usability for developers worldwide.

The Genesis of Spark’s Vision

Spark was conceived with two primary ambitions: to create a unified engine capable of handling diverse big data workloads and to offer a concise, language-integrated API that mirrors working with local data collections. Matei explained that unlike the earlier MapReduce model, which was groundbreaking yet limited, Spark extended its capabilities to support iterative computations, streaming, and interactive data exploration. This unification was critical, as prior to Spark, developers often juggled multiple specialized systems, each with its own complexities, making integration cumbersome. By leveraging Scala’s functional constructs, Spark introduced Resilient Distributed Datasets (RDDs), allowing developers to perform operations like map, filter, and join with ease, abstracting the complexities of distributed computing.

The success of this vision is evident in Spark’s widespread adoption. With over a thousand organizations deploying it, including on clusters as large as 8,000 nodes, Spark has become the most active open-source big data project. Its libraries for SQL, streaming, machine learning, and graph processing have been embraced, with 75% of surveyed organizations using multiple components, demonstrating the power of its unified approach.

Challenges with the Functional API

Despite its strengths, the original RDD-based API presented challenges, particularly in optimization and efficiency. Matei highlighted that the functional API, while intuitive, conceals the semantics of computations, making it difficult for the engine to optimize operations automatically. For instance, operations like groupByKey can lead to inefficient memory usage, as they materialize large intermediate datasets unnecessarily. This issue is exemplified in a word count example where groupByKey creates a sequence of values before summing them, consuming excessive memory when a simpler reduceByKey could suffice.

Moreover, the reliance on Java objects for data storage introduces significant memory overhead. Matei illustrated this with a user class example, where headers, pointers, and padding consume roughly two-thirds of the allocated memory, a critical concern for in-memory computing frameworks like Spark. These challenges underscored the need for a more structured approach to data processing.

Introducing Structured APIs: DataFrames and Datasets

To address these limitations, Spark introduced DataFrames and Datasets, structured APIs built atop the Spark SQL engine. These APIs impose a defined schema on data, enabling the engine to understand and optimize computations more effectively. DataFrames, dynamically typed, resemble tables in a relational database, supporting operations like filtering and aggregation through a domain-specific language (DSL). Datasets, statically typed, extend this concept by aligning closely with Scala’s type system, allowing developers to work with case classes for type safety.

Matei demonstrated how DataFrames enable declarative programming, where operations are expressed as logical plans that Spark optimizes before execution. For example, filtering users by state generates an abstract syntax tree, allowing Spark to optimize the query plan rather than executing operations eagerly. This declarative nature, inspired by data science tools like Pandas, distinguishes Spark’s DataFrames from similar APIs in R and Python, enhancing performance through lazy evaluation and optimization.

Optimizing Performance with Project Tungsten

A significant focus of Spark 2.0 is Project Tungsten, which addresses the shifting bottlenecks in big data systems. Matei noted that while I/O was the primary constraint in 2010, advancements in storage (SSDs) and networking (10-40 gigabit) have shifted the focus to CPU efficiency. Tungsten employs three strategies: runtime code generation, cache locality exploitation, and off-heap memory management. By encoding data in a compact binary format, Spark reduces memory overhead compared to Java objects. Code generation, facilitated by the Catalyst optimizer, produces specialized bytecode that operates directly on binary data, improving CPU performance. These optimizations ensure Spark can leverage modern hardware trends, delivering significant performance gains.

Structured Streaming: A Unified Approach to Real-Time Processing

Spark 2.0 introduces structured streaming, a high-level API that extends the benefits of DataFrames and Datasets to streaming computations. Matei emphasized that real-world streaming applications often involve batch and interactive workloads, such as updating a database for a web application or applying a machine learning model. Structured streaming treats streams as infinite DataFrames, allowing developers to use familiar APIs to define computations. The engine then incrementally executes these plans, maintaining state and handling late data efficiently. For instance, a batch job grouping data by user ID can be adapted to streaming by changing the input source, with Spark automatically updating results as new data arrives.

This approach simplifies the development of continuous applications, enabling seamless integration of streaming, batch, and interactive processing within a single API, a capability that sets Spark apart from other streaming engines.

Future Directions and Community Engagement

Looking ahead, Matei outlined Spark’s commitment to evolving its APIs while maintaining compatibility. The structured APIs will serve as the foundation for new libraries, facilitating interoperability across languages like Python and R. Additionally, Spark’s data source API allows applications to seamlessly switch between storage systems like Hive, Cassandra, or JSON, enhancing flexibility. Matei also encouraged community participation, noting that Databricks offers a free Community Edition with tutorials to help developers explore Spark’s capabilities.

Links:

Posted in en-US | Tags: ApacheSpark, BigData, Databricks, DataFrames, Datasets, MateiZaharia, Scala, ScalaDaysNewYork2016, StructuredStreaming | No Comments »