Recent Posts
Archives

Posts Tagged ‘BigData’

PostHeaderIcon [DevoxxPL2022] Accelerating Big Data: Modern Trends Enable Product Analytics • Boris Trofimov

Boris Trofimov, a big data expert from Sigma Software, delivered an insightful presentation at Devoxx Poland 2022, exploring modern trends in big data that enhance product analytics. With experience building high-load systems like the AOL data platform for Verizon Media, Boris provided a comprehensive overview of how data platforms are evolving. His talk covered architectural innovations, data governance, and the shift toward serverless and ELT (Extract, Load, Transform) paradigms, offering actionable insights for developers navigating the complexities of big data.

The Evolving Role of Data Platforms

Boris began by demystifying big data, often misconstrued as a magical solution for business success. He clarified that big data resides within data platforms, which handle ingestion, processing, and analytics. These platforms typically include data sources, ETL (Extract, Transform, Load) pipelines, data lakes, and data warehouses. Boris highlighted the growing visibility of big data beyond its traditional boundaries, with data engineers playing increasingly critical roles. He noted the rise of cross-functional teams, inspired by Martin Fowler’s ideas, where subdomains drive team composition, fostering collaboration between data and backend engineers.

The convergence of big data and backend practices was a key theme. Boris pointed to technologies like Apache Kafka and Spark, which are now shared across both domains, enabling mutual learning. He emphasized that modern data platforms must balance complexity with efficiency, requiring specialized expertise to avoid pitfalls like project failures due to inadequate practices.

Architectural Innovations: From Lambda to Delta

Boris delved into big data architectures, starting with the Lambda architecture, which separates data processing into speed (real-time) and batch layers for high availability. While effective, Lambda’s complexity increases development and maintenance costs. As an alternative, he introduced the Kappa architecture, which simplifies processing by using a single streaming layer, reducing latency but potentially sacrificing availability. Boris then highlighted the emerging Delta architecture, which leverages data lakehouses—hybrid systems combining data lakes and warehouses. Technologies like Snowflake and Databricks support Delta, minimizing data hops and enabling both batch and streaming workloads with a single storage layer.

The Delta architecture’s rise reflects the growing popularity of data lakehouses, which Boris praised for their ability to handle raw, processed, and aggregated data efficiently. By reducing technological complexity, Delta enables faster development and lower maintenance, making it a compelling choice for modern data platforms.

Data Mesh and Governance

Boris introduced data mesh as a response to monolithic data architectures, drawing parallels with domain-driven design. Data mesh advocates for breaking down data platforms into bounded contexts, each owned by a dedicated team responsible for its pipelines and decisions. This approach avoids the pitfalls of monolithic pipelines, such as chaotic dependencies and scalability issues. Boris outlined four “temptations” to avoid: building monolithic pipelines, combining all pipelines into one application, creating chaotic pipeline networks, and mixing domains in data tables. Data mesh, he argued, promotes modularity and ownership, treating data as a product.

Data governance, or “data excellence,” was another critical focus. Boris stressed the importance of practices like data monitoring, quality validation, and retention policies. He advocated for a proactive approach, where engineers address these concerns early to ensure platform reliability and cost-efficiency. By treating data governance as a checklist, teams can mitigate risks and enhance platform maturity.

Serverless and ELT: Simplifying Big Data

Boris highlighted the shift toward serverless technologies and ELT paradigms. Serverless solutions, available across transformation, storage, and analytics tiers, reduce infrastructure management burdens, allowing faster time-to-market. He cited AWS and other cloud providers as enablers, noting that while not always cost-effective, serverless minimizes maintenance efforts. Similarly, ELT—where transformation occurs after loading data into a warehouse—leverages modern databases like Snowflake and BigQuery. Unlike traditional ETL, ELT reduces latency and complexity by using database capabilities for transformations, making it ideal for early-stage projects.

Boris also noted the resurgence of SQL as a domain-specific language across big data tiers, from transformation to governance. By building frameworks that express business logic in SQL, developers can accelerate feature delivery, despite SQL’s perceived limitations. He emphasized that well-designed SQL queries can be powerful, provided engineers avoid poorly structured code.

Productizing Big Data and Business Intelligence

The final trend Boris explored was the productization of big data solutions. He likened this to Intel’s microprocessor revolution, where standardized components accelerated hardware development. Companies like Absorber offer “data platform as a service,” enabling rapid construction of data pipelines through drag-and-drop interfaces. While limited for complex use cases, such solutions cater to organizations seeking quick deployment. Boris also discussed the rise of serverless business intelligence (BI) tools, which support ELT and allow cross-cloud data queries. These tools, like Mode and Tableau, enable self-service analytics, reducing the need for custom platforms in early stages.

Links:

PostHeaderIcon [DevoxxPL2022] Before It’s Too Late: Finding Real-Time Holes in Data • Chayim Kirshen

Chayim Kirshen, a veteran of the startup ecosystem and client manager at Redis, captivated audiences at Devoxx Poland 2022 with a dynamic exploration of real-time data pipeline challenges. Drawing from his experience with high-stakes environments, including a 2010 stock exchange meltdown, Chayim outlined strategies to ensure data integrity and performance in large-scale systems. His talk provided actionable insights for developers, emphasizing the importance of storing raw data, parsing in real time, and leveraging technologies like Redis to address data inconsistencies.

The Perils of Unclean Data

Chayim began with a stark reality: data is rarely clean. Recounting a 2010 incident where hackers compromised a major stock exchange’s API, he highlighted the cascading effects of unreliable data on real-time markets. Data pipelines face issues like inconsistent formats (CSV, JSON, XML), changing sources (e.g., shifting API endpoints), and service reliability, with modern systems often tolerating over a thousand minutes of downtime annually. These challenges disrupt real-time processing, critical for applications like stock exchanges or ad bidding networks requiring sub-100ms responses. Chayim advocated treating data as programmable code, enabling developers to address issues systematically rather than reactively.

Building Robust Data Pipelines

To tackle these issues, Chayim proposed a structured approach to data pipeline design. Storing raw data indefinitely—whether in S3, Redis, or other storage—ensures a fallback for reprocessing. Parsing data in real time, using defined schemas, allows immediate usability while preserving raw inputs. Bulk changes, such as SQL bulk inserts or Redis pipelines, reduce network overhead, critical for high-throughput systems. Chayim emphasized scheduling regular backfills to re-import historical data, ensuring consistency despite source changes. For example, a stock exchange’s ticker symbol updates (e.g., Fitbit to Google) require ongoing reprocessing to maintain accuracy. Horizontal scaling, using disposable nodes, enhances availability and resilience, avoiding single points of failure.

Real-Time Enrichment and Redis Integration

Data enrichment, such as calculating stock bid-ask spreads or market cap changes, should occur post-ingestion to avoid slowing the pipeline. Chayim showcased Redis, particularly its Gears and JSON modules, for real-time data processing. Redis acts as a buffer, storing raw JSON and replicating it to traditional databases like PostgreSQL or MySQL. Using Redis Gears, developers can execute functions within the database, minimizing network costs and enabling rapid enrichment. For instance, calculating a stock’s daily percentage change can run directly in Redis, streamlining analytics. Chayim highlighted Python-based tools like Celery for scheduling backfills and enrichments, allowing asynchronous processing and failure retries without disrupting the main pipeline.

Scaling and Future-Proofing

Chayim stressed horizontal scaling to distribute workloads geographically, placing data closer to users for low-latency access, as seen in ad networks. By using Redis for real-time writes and offloading to workers via Celery, developers can manage millions of daily entries, such as stock ticks, without performance bottlenecks. Scheduled backfills address data gaps, like API schema changes (e.g., integer to string conversions), by reprocessing raw data. This approach, combined with infrastructure-as-code tools like Terraform, ensures scalability and adaptability, allowing organizations to focus on business logic rather than data management overhead.

Links:

PostHeaderIcon [ScalaDaysNewYork2016] Large-Scale Graph Analysis with Scala and Akka

Ben Fonarov, a Big Data specialist at Capital One, presented a compelling case study at Scala Days New York 2016 on building a large-scale graph analysis engine using Scala, Akka, and HBase. Ben detailed the architecture and implementation of Athena, a distributed time-series graph system designed to deliver integrated, real-time data to enterprise users, addressing the challenges of data overload in a banking environment.

Addressing Enterprise Data Needs

Ben Fonarov opened by outlining the motivation behind Athena: the need to provide integrated, real-time data to users at Capital One. Unlike traditional table-based thinking, Athena represents data as a graph, modeling entities like accounts and transactions to align with business concepts. Ben highlighted the challenges of data overload, with multiple data warehouses and ETL processes generating vast datasets. Athena’s visual interface allows users to define graph schemas, ensuring data is accessible in a format that matches their mental models.

Architectural Considerations

Ben described two architectural approaches to building Athena. The naive implementation used a single actor to process queries, which was insufficient for production-scale loads. The robust solution leveraged an Akka cluster, distributing query processing across nodes for scalability. A query parser translated user requests into graph traversals, while actors managed tasks and streamed results to users. This design ensured low latency and scalability, handling up to 200 billion nodes efficiently.

Streaming and Optimization

A key feature of Athena, Ben explained, was its ability to stream results in real time, avoiding the batch processing limitations of frameworks like TinkerPop’s Gremlin. By using Akka’s actor-based concurrency, Athena processes queries incrementally, delivering results as they are computed. Ben discussed optimizations, such as limiting the number of nodes per actor to prevent bottlenecks, and plans to integrate graph algorithms like PageRank to enhance analytical capabilities.

Future Directions and Community Engagement

Ben concluded by sharing future plans for Athena, including adopting a Gremlin-like DSL for graph traversals and integrating with tools like Spark and H2O. He emphasized the importance of community feedback, inviting developers to join Capital One’s data team to contribute to Athena’s evolution. Running on AWS EC2, Athena represents a scalable solution for enterprise graph analysis, poised to transform how banks handle complex data relationships.

Links:

PostHeaderIcon [ScalaDaysNewYork2016] Spark 2.0: Evolving Big Data Processing with Structured APIs

Apache Spark, a cornerstone in big data processing, has significantly shaped the landscape of distributed computing with its functional programming paradigm rooted in Scala. In a keynote address at Scala Days New York 2016, Matei Zaharia, the creator of Spark, elucidated the evolution of Spark’s APIs, culminating in the transformative release of Spark 2.0. This presentation highlighted how Spark has progressed from its initial vision of a unified engine to a more sophisticated platform with structured APIs like DataFrames and Datasets, enabling enhanced performance and usability for developers worldwide.

The Genesis of Spark’s Vision

Spark was conceived with two primary ambitions: to create a unified engine capable of handling diverse big data workloads and to offer a concise, language-integrated API that mirrors working with local data collections. Matei explained that unlike the earlier MapReduce model, which was groundbreaking yet limited, Spark extended its capabilities to support iterative computations, streaming, and interactive data exploration. This unification was critical, as prior to Spark, developers often juggled multiple specialized systems, each with its own complexities, making integration cumbersome. By leveraging Scala’s functional constructs, Spark introduced Resilient Distributed Datasets (RDDs), allowing developers to perform operations like map, filter, and join with ease, abstracting the complexities of distributed computing.

The success of this vision is evident in Spark’s widespread adoption. With over a thousand organizations deploying it, including on clusters as large as 8,000 nodes, Spark has become the most active open-source big data project. Its libraries for SQL, streaming, machine learning, and graph processing have been embraced, with 75% of surveyed organizations using multiple components, demonstrating the power of its unified approach.

Challenges with the Functional API

Despite its strengths, the original RDD-based API presented challenges, particularly in optimization and efficiency. Matei highlighted that the functional API, while intuitive, conceals the semantics of computations, making it difficult for the engine to optimize operations automatically. For instance, operations like groupByKey can lead to inefficient memory usage, as they materialize large intermediate datasets unnecessarily. This issue is exemplified in a word count example where groupByKey creates a sequence of values before summing them, consuming excessive memory when a simpler reduceByKey could suffice.

Moreover, the reliance on Java objects for data storage introduces significant memory overhead. Matei illustrated this with a user class example, where headers, pointers, and padding consume roughly two-thirds of the allocated memory, a critical concern for in-memory computing frameworks like Spark. These challenges underscored the need for a more structured approach to data processing.

Introducing Structured APIs: DataFrames and Datasets

To address these limitations, Spark introduced DataFrames and Datasets, structured APIs built atop the Spark SQL engine. These APIs impose a defined schema on data, enabling the engine to understand and optimize computations more effectively. DataFrames, dynamically typed, resemble tables in a relational database, supporting operations like filtering and aggregation through a domain-specific language (DSL). Datasets, statically typed, extend this concept by aligning closely with Scala’s type system, allowing developers to work with case classes for type safety.

Matei demonstrated how DataFrames enable declarative programming, where operations are expressed as logical plans that Spark optimizes before execution. For example, filtering users by state generates an abstract syntax tree, allowing Spark to optimize the query plan rather than executing operations eagerly. This declarative nature, inspired by data science tools like Pandas, distinguishes Spark’s DataFrames from similar APIs in R and Python, enhancing performance through lazy evaluation and optimization.

Optimizing Performance with Project Tungsten

A significant focus of Spark 2.0 is Project Tungsten, which addresses the shifting bottlenecks in big data systems. Matei noted that while I/O was the primary constraint in 2010, advancements in storage (SSDs) and networking (10-40 gigabit) have shifted the focus to CPU efficiency. Tungsten employs three strategies: runtime code generation, cache locality exploitation, and off-heap memory management. By encoding data in a compact binary format, Spark reduces memory overhead compared to Java objects. Code generation, facilitated by the Catalyst optimizer, produces specialized bytecode that operates directly on binary data, improving CPU performance. These optimizations ensure Spark can leverage modern hardware trends, delivering significant performance gains.

Structured Streaming: A Unified Approach to Real-Time Processing

Spark 2.0 introduces structured streaming, a high-level API that extends the benefits of DataFrames and Datasets to streaming computations. Matei emphasized that real-world streaming applications often involve batch and interactive workloads, such as updating a database for a web application or applying a machine learning model. Structured streaming treats streams as infinite DataFrames, allowing developers to use familiar APIs to define computations. The engine then incrementally executes these plans, maintaining state and handling late data efficiently. For instance, a batch job grouping data by user ID can be adapted to streaming by changing the input source, with Spark automatically updating results as new data arrives.

This approach simplifies the development of continuous applications, enabling seamless integration of streaming, batch, and interactive processing within a single API, a capability that sets Spark apart from other streaming engines.

Future Directions and Community Engagement

Looking ahead, Matei outlined Spark’s commitment to evolving its APIs while maintaining compatibility. The structured APIs will serve as the foundation for new libraries, facilitating interoperability across languages like Python and R. Additionally, Spark’s data source API allows applications to seamlessly switch between storage systems like Hive, Cassandra, or JSON, enhancing flexibility. Matei also encouraged community participation, noting that Databricks offers a free Community Edition with tutorials to help developers explore Spark’s capabilities.

Links:

PostHeaderIcon [DevoxxBE2013] Building Hadoop Big Data Applications

Tom White, an Apache Hadoop committer and author of Hadoop: The Definitive Guide, explores the complexities of building big data applications with Hadoop. As an engineer at Cloudera, Tom introduces the Cloudera Development Kit (CDK), an open-source project simplifying Hadoop application development. His session navigates common pitfalls, best practices, and CDK’s role in streamlining data processing across Hadoop’s ecosystem.

Hadoop’s growth has introduced diverse components like Hive and Impala, challenging developers to choose appropriate tools. Tom demonstrates CDK’s unified abstractions, enabling seamless integration across engines, and shares practical examples of low-latency queries and fault-tolerant batch processing.

Navigating Hadoop’s Ecosystem

Tom outlines Hadoop’s complexity: HDFS, MapReduce, Hive, and Impala serve distinct purposes. He highlights pitfalls like schema mismatches across tools. CDK abstracts these, allowing a single dataset definition for Hive and Impala.

This unification, Tom shows, reduces errors, streamlining development.

Best Practices for Application Development

Tom advocates defining datasets in Java, ensuring compatibility across engines. He demonstrates CDK’s API, creating a dataset accessible by both Hive’s batch transforms and Impala’s low-latency queries.

Best practices include modular schemas and automated metadata synchronization, minimizing manual refreshes.

CDK’s Role in Simplifying Development

The CDK, Tom explains, centralizes dataset management. A live demo shows indexing data for Impala’s millisecond-range queries and Hive’s fault-tolerant ETL processes. This abstraction enhances productivity, letting developers focus on logic.

Tom notes ongoing CDK improvements, like automatic metastore refreshes, enhancing usability.

Choosing Between Hive and Impala

Tom contrasts Impala’s low-latency, non-fault-tolerant queries with Hive’s robust batch processing. For ad-hoc summaries, Impala excels; for ETL transforms, Hive’s fault tolerance shines.

He demonstrates a CDK dataset serving both, offering flexibility for diverse workloads.

Links: