Archive for the ‘en-US’ Category
[DotSecurity2017] Secure Software Development Lifecycle
In the forge of functional fortification, where code coalesces into capabilities, embedding security sans sacrificing swiftness stands as the alchemist’s art. Jim Manico, founder of Manicode Security and erstwhile OWASP steward, alchemized this axiom at dotSecurity 2017, furnishing frameworks for fortifying the software development lifecycle (SDLC) from inception to iteration. A Hawaiian hui of secure coding savant, Jim’s odyssey—from Siena’s scrolls to Edgescan’s enterprise—equips his edicts with empirical edge, transforming tedious tenets into tactical triumphs that temper expense through early engagement.
Jim’s jaunt journeyed SDLC’s stations: analysis’s augury (requirements’ rigor, threats’ taxonomy), design’s delineation (architectural audits, data flow diagrams), coding’s crucible (checklists’ chisel, libraries’ ledger), testing’s tribunal (static sentinels, dynamic drills), operations’ observatory (monitoring’s mantle, incident’s inquest). Agile’s alacrity or waterfall’s wash notwithstanding, phases persist—analysis’s abstraction a month or minute, testing’s tenacity from triage to telemetry. Jim jabbed at jargon: process’s pallor palls without practicality—checklists conquer compendiums, triage trumps torrent.
Requirements’ realm reigns: OWASP’s taxonomy as talisman—access’s armature, injection’s inveiglement—blueprints birthing bug bounties. Design’s domain: threat modeling’s mosaic (STRIDE’s strata: spoofing’s specter to tampering’s thorn), data’s diagram (flows fortified, endpoints etched). Coding’s canon: Manicode’s missives—input’s inquisition (sanitization’s sieve), output’s oracle (encoding’s aegis)—libraries’ litany (npm’s audit, Snyk’s scrutiny). Testing’s tier: static’s scalpel (SonarQube’s scan, Coverity’s critique—rules’ rationing for relevance), dynamic’s delve (DAST’s dart, IAST’s insight). Operations’ oversight: logging’s ledger (anomalies’ alert), patching’s patrol (vulnerabilities’ vigil).
Jim’s jeremiad: late lamentations lavish lucre—early excision economizes, triage tempers toil. Static’s sacrament: compilers’ cognizance, rules’ refinement—devops’ deployment, developers’ deliverance from deluge.
SDLC’s Stations and Security’s Scaffold
Jim mapped milestones: analysis’s augury, design’s diagram—coding’s checklist, testing’s tier. Operations’ observatory: monitoring’s mantle, incident’s inquest.
Tenets’ Triumph and Tools’ Temperance
OWASP’s oracle, threat’s taxonomy—static’s scalpel, dynamic’s delve. Jim’s jewel: early’s economy, triage’s temperance—checklists conquer, compendiums crumble.
Links:
[ScalaDaysNewYork2016] The Zen of Akka: Mastering Asynchronous Design
At Scala Days New York 2016, Konrad Malawski, a key member of the Akka team at Lightbend, delivered a profound exploration of the principles guiding the effective use of Akka, a toolkit for building concurrent and distributed systems. Konrad’s presentation, inspired by the philosophical lens of “The Tao of Programming,” offered practical insights into designing applications with Akka, emphasizing the shift from synchronous to asynchronous paradigms to achieve robust, scalable architectures.
Embracing the Messaging Paradigm
Konrad Malawski began by underscoring the centrality of messaging in Akka’s actor model. Drawing from Alan Kay’s vision of object-oriented programming, Konrad explained that actors encapsulate state and communicate solely through messages, mirroring real-world computing interactions. This approach fosters loose coupling, both spatially and temporally, allowing components to operate independently. A single actor, Konrad noted, is limited in utility, but when multiple actors collaborate—such as delegating tasks to specialized actors like a “yellow specialist”—powerful patterns like worker pools and sharding emerge. These patterns enable efficient workload distribution, aligning perfectly with the distributed nature of modern systems.
Structuring Actor Systems for Clarity
A common pitfall for newcomers to Akka, Konrad observed, is creating unstructured systems with actors communicating chaotically. To counter this, he advocated for hierarchical actor systems using context.actorOf to spawn child actors, ensuring a clear supervisory structure. This hierarchy not only organizes actors but also enhances fault tolerance through supervision, where parent actors manage failures of their children. Konrad cautioned against actor selection—directly addressing actors by path—as it leads to brittle designs akin to “stealing a TV from a stranger’s house.” Instead, actors should be introduced through proper references, fostering maintainable and predictable interactions.
Balancing Power and Constraints
Konrad emphasized the philosophy of “constraints liberate, liberties constrain,” a principle echoed across Scala conferences. Akka actors, being highly flexible, can perform a wide range of tasks, but this power can overwhelm developers. He contrasted actors with more constrained abstractions like futures, which handle single values, and Akka Streams, which enforce a static data flow. These constraints enable optimizations, such as transparent backpressure in streams, which are harder to implement in the dynamic actor model. However, actors excel in distributed settings, where messaging simplifies scaling across nodes, making Akka a versatile choice for complex systems.
Community and Future Directions
Konrad highlighted the vibrant Akka community, encouraging contributions through platforms like GitHub and Gitter. He noted ongoing developments, such as Akka Typed, an experimental API that enhances type safety in actor interactions. By sharing resources like the Reactive Streams TCK and community-driven initiatives, Konrad underscored Lightbend’s commitment to evolving Akka collaboratively. His call to action was clear: engage with the community, experiment with new features, and contribute to shaping Akka’s future, ensuring it remains a cornerstone of reactive programming.
Links:
[DevoxxUS2017] 55 New Features in JDK 9: A Comprehensive Overview
At DevoxxUS2017, Simon Ritter, Deputy CTO at Azul Systems, delivered a detailed exploration of the 55 new features in JDK 9, with a particular focus on modularity through Project Jigsaw. Simon, a veteran Java evangelist, provided a whirlwind tour of the enhancements, categorizing them into features, standards, JVM internals, specialized updates, and housekeeping changes. His presentation equipped developers with the knowledge to leverage JDK 9’s advancements effectively. This post examines the key themes of Simon’s talk, highlighting how these features enhance Java’s flexibility, performance, and maintainability.
Modularity and Project Jigsaw
The cornerstone of JDK 9 is Project Jigsaw, which introduces modularity to the Java platform. Simon explained that the traditional rt.jar file, containing over 4,500 classes, has been replaced with 94 modular components in the jmods directory. This restructuring encapsulates private APIs, such as sun.misc.Unsafe, to improve security and maintainability, though it poses compatibility challenges for libraries relying on these APIs. To mitigate this, Simon highlighted options like the --add-exports and --add-opens flags, as well as a “big kill switch” (--permit-illegal-access) to disable modularity for legacy applications. The jlink tool further enhances modularity by creating custom runtimes with only the necessary modules, optimizing deployment for specific applications.
Enhanced APIs and Developer Productivity
JDK 9 introduces several API improvements to streamline development. Simon showcased factory methods for collections, allowing developers to create immutable collections with concise syntax, such as List.of() or Set.of(). The Streams API has been enhanced with methods like takeWhile, dropWhile, and ofNullable, improving expressiveness in data processing. Additionally, the introduction of jshell, an interactive REPL, enables rapid prototyping and experimentation. These enhancements reduce boilerplate code and enhance developer productivity, making Java more intuitive and efficient for modern application development.
JVM Internals and Performance
Simon delved into JVM enhancements, including improvements to the G1 garbage collector, which is now the default in JDK 9. The G1 collector offers better performance for large heaps, addressing limitations of the Concurrent Mark Sweep collector. Other internal improvements include a new process API for accessing operating system process details and a directive file for controlling JIT compiler behavior. These changes enhance runtime efficiency and provide developers with greater control over JVM performance, ensuring Java remains competitive for high-performance applications.
Housekeeping and Deprecations
JDK 9 includes significant housekeeping changes to streamline the platform. Simon highlighted the new version string format, adopting semantic versioning (major.minor.security.patch) for clearer identification. The directory structure has been flattened, eliminating the JRE subdirectory and tools.jar, with configuration files centralized in the conf directory. Deprecated APIs, such as the applet API and certain garbage collection options, have been removed to reduce maintenance overhead. These changes simplify the JDK’s structure, improving maintainability while requiring developers to test applications for compatibility.
Standards and Specialized Features
Simon also covered updates to standards and specialized features. The HTTP/2 client, introduced as an incubator module, allows developers to test and provide feedback before it becomes standard. Other standards updates include support for Unicode 8.0 and the deprecation of SHA-1 certificates for enhanced security. Specialized features, such as the annotations pipeline and parser API, improve the handling of complex annotations and programmatic interactions with the compiler. These updates ensure Java aligns with modern standards while offering flexibility for specialized use cases.
Links:
[ScalaDaysNewYork2016] Monitoring Reactive Applications: New Approaches for a New Paradigm
Reactive applications, built on event-driven and asynchronous foundations, require innovative monitoring strategies. At Scala Days New York 2016, Duncan DeVore and Henrik Engström, both from Lightbend, explored the challenges and solutions for monitoring such systems. They discussed how traditional monitoring falls short for reactive architectures and introduced Lightbend’s approach to addressing these challenges, emphasizing adaptability and precision in observing distributed systems.
The Shift from Traditional Monitoring
Duncan and Henrik began by outlining the limitations of traditional monitoring, which relies on stack traces in synchronous systems to diagnose issues. In reactive applications, built with frameworks like Akka and Play, the asynchronous, message-driven nature disrupts this model. Stack traces lose relevance, as actors communicate without a direct call stack. The speakers categorized monitoring into business process, functional, and technical types, highlighting the need to track metrics like actor counts, message flows, and system performance in distributed environments.
The Impact of Distributed Systems
The rise of the internet and cloud computing has transformed system design, as Duncan explained. Distributed computing, pioneered by initiatives like ARPANET, and the economic advantages of cloud platforms have enabled businesses to scale rapidly. However, this shift introduces complexities, such as network partitions and variable workloads, necessitating new monitoring approaches. Henrik noted that reactive systems, designed for scalability and resilience, require tools that can handle dynamic data flows and provide insights into system behavior without relying on traditional metrics.
Challenges in Monitoring Reactive Systems
Henrik detailed the difficulties of monitoring asynchronous systems, where data flows through push or pull models. In push-based systems, monitoring tools must handle high data volumes, risking overload, while pull-based systems allow selective querying for efficiency. The speakers emphasized anomaly detection over static thresholds, as thresholds are hard to calibrate and may miss nuanced issues. Anomaly detection, exemplified by tools like Prometheus, identifies unusual patterns by correlating metrics, reducing false alerts and enhancing system understanding.
Lightbend’s Monitoring Solution
Duncan and Henrik introduced Lightbend Monitoring, a subscription-based tool tailored for reactive applications. It integrates with Akka actors and Lagom circuit breakers, generating metrics and traces for backends like StatsD and Telegraf. The solution supports pull-based monitoring, allowing selective data collection to manage high data volumes. Future enhancements include support for distributed tracing, Prometheus integration, and improved Lagom compatibility, aiming to provide a comprehensive view of system health and performance.
Links:
[DotSecurity2017] Post-Quantum Cryptography
In the shadowed corridors of computational evolution, where qubits dance on the precipice of unraveling classical safeguards, the specter of quantum supremacy looms as both marvel and menace. Tanja Lange, a pioneering cryptographer and chair of the Coding Theory and Cryptology group at Eindhoven University of Technology, confronted this conundrum at dotSecurity 2017, elucidating the imperative for encryption resilient to tomorrow’s quantum tempests. With a career illuminating the interstices of mathematics and machine security, Tanja dissected the vulnerabilities plaguing contemporary ciphers—RSA’s reliance on factorization’s fortress, ECC’s elliptic enigmas—while heralding lattice-based bastions and code-theoretic countermeasures as beacons of post-quantum fortitude. This discourse transcends abstraction; it charts a course for safeguarding secrets sown today from harvests reaped by adversaries armed with tomorrow’s arithmetic.
Tanja’s treatise commenced with cryptography’s ubiquity: the browser’s lock icon, a talisman of TLS’s aegis, enshrines RSA or Diffie-Hellman duos, their potency predicated on problems polynomials presume intractable. Yet, Shor’s quantum sleight—factoring in factorial fractions, discrete logs dispatched—threatens this tranquility. Grover’s oracle amplifies: symmetric keys halved in fortitude, AES-256’s bulwark bruised to 128-bit equivalence. Retroactive peril compounds: “harvest now, decrypt later,” state actors stockpiling streams for quantum quelling. Tanja tallied timelines: Google’s Sycamore’s supremacy in 2019, IBM’s 2023 roadmap to 1,000+ qubits—2025’s horizon harbors harbingers capable of cracking 2048-bit RSA in hours.
Post-quantum’s pantheon pivots on presumptions quantum-proof: lattices’ learning with errors (LWE), multivariate quadratics’ mazes, hash’s hierarchies. Tanja traversed LWE’s labyrinth: vectors veiled in noise, decoding’s dichotomy—structured sparsity succumbing sans trapdoors, randomness repelling revelation. McEliece’s mantle, code-based cryptography’s cornerstone since 1978, endures: Goppa codes’ generator matrices, encryption as error-infused syndromes—decryption’s discernment demands secret scaffolds. Tanja touted standardization’s sprint: NIST’s 2016 clarion, 2022’s Kyber crystallization (lattice largesse), Dilithium’s digital signatures—round three’s rites refining resilience.
Challenges cascade: key sizes’ kilobyte burdens (Kyber’s 1KB public, McEliece’s megabyte monoliths), signatures’ sprawl—yet optimizations orbit: hybrid harbingers blending classical clutches with quantum cautions. Tanja tempered trepidation: current crypto’s continuum, migration’s mosaic—signal spikes, certificate cascades. Her horizon: PQC’s proliferation, from Chrome’s 2024 infusions to IETF’s interoperability—ensuring enclaves eternal against entanglement’s edge.
Quantum’s Quandary and Classical Cracks
Tanja traced threats: Shor’s sieve shattering RSA’s ramparts, Grover’s grope gnawing symmetric sinews—harvest’s haunt, 2025’s qubit quorum. ECC’s edifice echoes: elliptic’s enigmas eclipsed, Diffie-Hellman’s duels dissolved.
Lattice Locks and Code Crypts
LWE’s veil: noise’s nebula, trapdoors’ trove—McEliece’s matrices, Goppa’s girth. NIST’s novelties: Kyber’s kernels, Dilithium’s declarations—hybrids’ harmony, keys’ curtailment.
Migration’s Mandate and Horizons
Tanja’s timeline: signal’s surge, certs’ cascade—Chrome’s convergence, IETF’s accord. PQC’s promise: enclaves enduring, entanglement evaded.
Links:
[ScalaDaysNewYork2016] Lightbend Lagom: Crafting Microservices with Precision
Microservices have become a cornerstone of modern software architecture, yet their complexity often poses challenges. At Scala Days New York 2016, Mirco Dotta, a software engineer at Lightbend, introduced Lagom, an open-source framework designed to simplify the creation of reactive microservices. Mirco showcased how Lagom, meaning “just right” in Swedish, balances developer productivity with adherence to reactive principles, offering a seamless experience from development to production.
The Philosophy of Lagom
Mirco emphasized that Lagom prioritizes appropriately sized services over the “micro” aspect of microservices. By focusing on clear boundaries and isolation, Lagom ensures services are neither too small nor overly complex, aligning with the Swedish concept of sufficiency. Built on Play Framework and Akka, Lagom is inherently asynchronous and non-blocking, promoting scalability and resilience. Mirco highlighted its opinionated approach, which standardizes service structures to enhance consistency across teams, allowing developers to focus on domain logic rather than infrastructure.
Development Environment Efficiency
Lagom’s development environment, inspired by Play Framework, is a standout feature. Mirco demonstrated this with a sample application called Cheerer, a Twitter-like service. Using a single SBT command, runAll, developers can launch all services, including an embedded Cassandra server, service locator, and gateway, within one JVM. The environment supports hot reloading, automatically recompiling and restarting services upon code changes. This streamlined setup, consistent across different machines, frees developers from managing complex scripts, enhancing productivity and collaboration.
Service and Persistence APIs
Lagom’s service API is defined through a descriptor method, specifying endpoints and metadata for inter-service communication. Mirco showcased a “Hello World” service, illustrating how services expose endpoints that other services can call, facilitated by the service locator. For persistence, Lagom defaults to Cassandra, leveraging its scalability and resilience, but allows flexibility for other data stores. Mirco advocated for event sourcing and CQRS (Command Query Responsibility Segregation), noting their suitability for microservices. These patterns enable immutable event logs and optimized read views, simplifying data management and scalability.
Production-Ready Features
Transitioning to production is seamless with Lagom, as Mirco demonstrated through its integration with SBT Native Packager, supporting formats like Docker images and RPMs. Lightbend Conductor, available for free in development, simplifies orchestration, offering features like rolling upgrades and circuit breakers for fault tolerance. Mirco highlighted ongoing work to support other orchestration tools like Kubernetes, encouraging community contributions to expand Lagom’s ecosystem. Circuit breakers and monitoring capabilities further ensure service reliability in production environments.
Links:
[ScalaDaysNewYork2016] Connecting Reactive Applications with Fast Data Using Reactive Streams
The rapid evolution of data processing demands systems that can handle real-time information efficiently. At Scala Days New York 2016, Luc Bourlier, a software engineer at Lightbend, delivered an insightful presentation on integrating reactive applications with fast data architectures using Apache Spark and Reactive Streams. Luc demonstrated how Spark Streaming, enhanced with backpressure support in Spark 1.5, enables seamless connectivity between reactive systems and real-time data processing, ensuring responsiveness under varying workloads.
Understanding Fast Data
Luc began by defining fast data as the application of big data tools and algorithms to streaming data, enabling near-instantaneous insights. Unlike traditional big data, which processes stored datasets, fast data focuses on analyzing data as it arrives. Luc illustrated this with a scenario where a business initially runs batch jobs to analyze historical data but soon requires daily, hourly, or even real-time updates to stay competitive. This shift from batch to streaming processing underscores the need for systems that can adapt to dynamic data inflows, a core principle of fast data architectures.
Spark Streaming and Backpressure
Central to Luc’s presentation was Spark Streaming, an extension of Apache Spark designed for real-time data processing. Spark Streaming processes data in mini-batches, allowing it to leverage Spark’s in-memory computation capabilities, a significant advancement over Hadoop’s disk-based MapReduce model. Luc highlighted the introduction of backpressure in Spark 1.5, a feature developed by his team at Lightbend. Backpressure dynamically adjusts the data ingestion rate based on processing capacity, preventing system overload. By analyzing the number of records processed and the time taken in each mini-batch, Spark computes an optimal ingestion rate, ensuring stability even under high data volumes.
Reactive Streams Integration
To connect reactive applications with Spark Streaming, Luc introduced Reactive Streams, a set of Java interfaces designed to facilitate communication between systems with backpressure support. These interfaces allow a reactive application, such as one generating random numbers for a Pi computation demo, to feed data into Spark Streaming without overwhelming the system. Luc demonstrated this integration using a Raspberry Pi cluster, showcasing how backpressure ensures the system remains stable by throttling the data producer when processing lags. This approach maintains responsiveness, a key tenet of reactive systems, by aligning data production with consumption capabilities.
Practical Demonstration and Challenges
Luc’s live demo vividly illustrated the integration process. He presented a dashboard displaying a reactive application computing Pi approximations, with Spark analyzing the generated data in real time. Initially, the system handled 1,000 elements per second efficiently, but as the rate increased to 4,000, processing delays emerged without backpressure, causing data to accumulate in memory. By enabling backpressure, Luc showed how Spark adjusted the ingestion rate, maintaining processing times around one second and preventing system failure. He noted challenges, such as the need to handle variable-sized records, but emphasized that backpressure significantly enhances system reliability.
Future Enhancements
Looking forward, Luc discussed ongoing improvements to Spark’s backpressure mechanism, including better handling of aggregated records and potential integration with Reactive Streams for enhanced pluggability. He encouraged developers to explore Reactive Streams at reactivestreams.org, noting its inclusion in Java 9’s concurrent package. These advancements aim to further streamline the connection between reactive applications and fast data systems, making real-time processing more accessible and robust.
Links:
[ScalaDaysNewYork2016] Scala’s Road Ahead: Shaping the Future of a Versatile Language
Scala, a language renowned for blending functional and object-oriented programming, stands at a pivotal juncture as outlined by its creator, Martin Odersky, in his keynote at Scala Days New York 2016. Martin’s address explored Scala’s unique identity, recent developments like Scala 2.12 and the Scala Center, and the experimental Dotty compiler, offering a vision for the language’s evolution over the next five years. This talk underscored Scala’s commitment to balancing simplicity, power, and theoretical rigor while addressing community needs.
Scala’s Recent Milestones
Martin began by reflecting on Scala’s steady growth, evidenced by increasing job postings and Google Trends for Scala tutorials. The establishment of the Scala Center marks a significant milestone, providing a hub for community collaboration with support from industry leaders like Lightbend and Goldman Sachs. Additionally, Scala 2.12, set for release in mid-2016, optimizes for Java 8, leveraging lambdas and default methods to produce more compact and faster code. This release, with 33 new features and contributions from 65 committers, reflects Scala’s vibrant community and commitment to progress.
The Scala Center: Fostering Community Collaboration
The Scala Center, as Martin described, serves as a steward for Scala, focusing on projects that benefit the entire community. By coordinating contributions and fostering industrial partnerships, it aims to streamline development and ensure Scala’s longevity. While Martin deferred detailed discussion to Heather Miller’s keynote, he emphasized the center’s role in unifying efforts to enhance Scala’s ecosystem, making it a cornerstone for future growth.
Dotty: A New Foundation for Scala
Central to Martin’s vision is Dotty, a new Scala compiler built on the Dependent Object Types (DOT) calculus. This theoretical foundation, proven sound after an eight-year effort, provides a robust basis for evaluating new language features. Dotty, with a leaner codebase of 45,000 lines compared to the current compiler’s 75,000, offers faster compilation and simplifies the language’s internals by encoding complex features like type parameters into a minimal subset. This approach enhances confidence in language evolution, allowing developers to experiment with new constructs without compromising stability.
Evolving Scala’s Libraries
Looking beyond Scala 2.12, Martin outlined plans for Scala 2.13, focusing on revamping the standard library, particularly collections. Inspired by Spark’s lazy evaluation and pair datasets, Scala aims to simplify collections while maintaining compatibility. Proposals include splitting the library into a core module, containing essentials like collections, and a platform module for additional functionalities like JSON handling. This modular approach would enable dynamic updates and broader community contributions, addressing the challenges of maintaining a monolithic library.
Addressing Language Complexity
Martin acknowledged Scala’s reputation for complexity, particularly with features like implicits, which, while powerful, can lead to unexpected behavior if misused. To mitigate this, he proposed style guidelines, such as the principle of least power, encouraging developers to use the simplest constructs necessary. Additionally, he suggested enforcing rules for implicit conversions, limiting them to packages containing the source or target types to reduce surprises. These measures aim to balance Scala’s flexibility with usability, ensuring it remains approachable.
Future Innovations: Simplifying and Strengthening Scala
Martin’s vision for Scala includes several forward-looking features. Implicit function types will reduce boilerplate by abstracting over implicit parameters, while effect systems will treat side effects like exceptions as capabilities, enhancing type safety. Nullable types, modeled as union types, address Scala’s null-related issues, aligning it with modern languages like Kotlin. Generic programming improvements, inspired by libraries like Shapeless, aim to eliminate tuple limitations, and better records will support data engines like Spark. These innovations, grounded in Dotty’s foundations, promise a more robust and intuitive Scala.
Links:
[ScalaDaysNewYork2016] Spark 2.0: Evolving Big Data Processing with Structured APIs
Apache Spark, a cornerstone in big data processing, has significantly shaped the landscape of distributed computing with its functional programming paradigm rooted in Scala. In a keynote address at Scala Days New York 2016, Matei Zaharia, the creator of Spark, elucidated the evolution of Spark’s APIs, culminating in the transformative release of Spark 2.0. This presentation highlighted how Spark has progressed from its initial vision of a unified engine to a more sophisticated platform with structured APIs like DataFrames and Datasets, enabling enhanced performance and usability for developers worldwide.
The Genesis of Spark’s Vision
Spark was conceived with two primary ambitions: to create a unified engine capable of handling diverse big data workloads and to offer a concise, language-integrated API that mirrors working with local data collections. Matei explained that unlike the earlier MapReduce model, which was groundbreaking yet limited, Spark extended its capabilities to support iterative computations, streaming, and interactive data exploration. This unification was critical, as prior to Spark, developers often juggled multiple specialized systems, each with its own complexities, making integration cumbersome. By leveraging Scala’s functional constructs, Spark introduced Resilient Distributed Datasets (RDDs), allowing developers to perform operations like map, filter, and join with ease, abstracting the complexities of distributed computing.
The success of this vision is evident in Spark’s widespread adoption. With over a thousand organizations deploying it, including on clusters as large as 8,000 nodes, Spark has become the most active open-source big data project. Its libraries for SQL, streaming, machine learning, and graph processing have been embraced, with 75% of surveyed organizations using multiple components, demonstrating the power of its unified approach.
Challenges with the Functional API
Despite its strengths, the original RDD-based API presented challenges, particularly in optimization and efficiency. Matei highlighted that the functional API, while intuitive, conceals the semantics of computations, making it difficult for the engine to optimize operations automatically. For instance, operations like groupByKey can lead to inefficient memory usage, as they materialize large intermediate datasets unnecessarily. This issue is exemplified in a word count example where groupByKey creates a sequence of values before summing them, consuming excessive memory when a simpler reduceByKey could suffice.
Moreover, the reliance on Java objects for data storage introduces significant memory overhead. Matei illustrated this with a user class example, where headers, pointers, and padding consume roughly two-thirds of the allocated memory, a critical concern for in-memory computing frameworks like Spark. These challenges underscored the need for a more structured approach to data processing.
Introducing Structured APIs: DataFrames and Datasets
To address these limitations, Spark introduced DataFrames and Datasets, structured APIs built atop the Spark SQL engine. These APIs impose a defined schema on data, enabling the engine to understand and optimize computations more effectively. DataFrames, dynamically typed, resemble tables in a relational database, supporting operations like filtering and aggregation through a domain-specific language (DSL). Datasets, statically typed, extend this concept by aligning closely with Scala’s type system, allowing developers to work with case classes for type safety.
Matei demonstrated how DataFrames enable declarative programming, where operations are expressed as logical plans that Spark optimizes before execution. For example, filtering users by state generates an abstract syntax tree, allowing Spark to optimize the query plan rather than executing operations eagerly. This declarative nature, inspired by data science tools like Pandas, distinguishes Spark’s DataFrames from similar APIs in R and Python, enhancing performance through lazy evaluation and optimization.
Optimizing Performance with Project Tungsten
A significant focus of Spark 2.0 is Project Tungsten, which addresses the shifting bottlenecks in big data systems. Matei noted that while I/O was the primary constraint in 2010, advancements in storage (SSDs) and networking (10-40 gigabit) have shifted the focus to CPU efficiency. Tungsten employs three strategies: runtime code generation, cache locality exploitation, and off-heap memory management. By encoding data in a compact binary format, Spark reduces memory overhead compared to Java objects. Code generation, facilitated by the Catalyst optimizer, produces specialized bytecode that operates directly on binary data, improving CPU performance. These optimizations ensure Spark can leverage modern hardware trends, delivering significant performance gains.
Structured Streaming: A Unified Approach to Real-Time Processing
Spark 2.0 introduces structured streaming, a high-level API that extends the benefits of DataFrames and Datasets to streaming computations. Matei emphasized that real-world streaming applications often involve batch and interactive workloads, such as updating a database for a web application or applying a machine learning model. Structured streaming treats streams as infinite DataFrames, allowing developers to use familiar APIs to define computations. The engine then incrementally executes these plans, maintaining state and handling late data efficiently. For instance, a batch job grouping data by user ID can be adapted to streaming by changing the input source, with Spark automatically updating results as new data arrives.
This approach simplifies the development of continuous applications, enabling seamless integration of streaming, batch, and interactive processing within a single API, a capability that sets Spark apart from other streaming engines.
Future Directions and Community Engagement
Looking ahead, Matei outlined Spark’s commitment to evolving its APIs while maintaining compatibility. The structured APIs will serve as the foundation for new libraries, facilitating interoperability across languages like Python and R. Additionally, Spark’s data source API allows applications to seamlessly switch between storage systems like Hive, Cassandra, or JSON, enhancing flexibility. Matei also encouraged community participation, noting that Databricks offers a free Community Edition with tutorials to help developers explore Spark’s capabilities.
Links:
(long tweet) When ‘filter’ does not work with Primefaces’ datatable
Abstract
Sometimes, the filter function in Primefaces <p:datatable/> does not work when the field on which filtering is operated typed as an enum.
Explanation
Actually, in order to filter, Primefaces relies on a direct '=' comparison. The hack to fix this issue is to force Primefaces to compare on the enum name, and not by a reference check.
Quick fix
In the enum class, add the following block:
[java]public String getName(){ return name(); }[/java]
Have the datatable declaration to look like:
[xml]<p:dataTable id="castorsDT" var="castor" value="#{managedCastorListManagedBean.initiatedCastors}" widgetVar="castorsTable" filteredValue="#{managedCastorListManagedBean.filteredCastors}">
[/xml]
Declare the enum-filtered column lke this:
[xml]<p:column sortBy="#{castor.castorWorkflowStatus}" filterable="true" filterBy="#{castor.castorWorkflowStatus.name}" filterMatchMode="in">
<f:facet name="filter">
<p:selectCheckboxMenu label="#{messages[‘status’]}" onchange="PF(‘castorsTable’).filter()">
<f:selectItems value="#{transverseManagedBean.allCastorWorkflowStatuses}" var="cws" itemLabel="#{cws.name}" itemValue="#{cws.name}"/>
</p:selectCheckboxMenu>
</f:facet>
</p:column>[/xml]
Notice how the filtering attribute is declared:
[xml]filterable="true" filterBy="#{castor.castorWorkflowStatus.name}" filterMatchMode="in"[/xml]
In other terms, the comparison is forced the rely on equals() of class String, through the calls to getName() and name().