Recent Posts
Archives

Posts Tagged ‘BigData’

PostHeaderIcon [DevoxxPL2022] Accelerating Big Data: Modern Trends Enable Product Analytics • Boris Trofimov

Boris Trofimov, a big data expert from Sigma Software, delivered an insightful presentation at Devoxx Poland 2022, exploring modern trends in big data that enhance product analytics. With experience building high-load systems like the AOL data platform for Verizon Media, Boris provided a comprehensive overview of how data platforms are evolving. His talk covered architectural innovations, data governance, and the shift toward serverless and ELT (Extract, Load, Transform) paradigms, offering actionable insights for developers navigating the complexities of big data.

The Evolving Role of Data Platforms

Boris began by demystifying big data, often misconstrued as a magical solution for business success. He clarified that big data resides within data platforms, which handle ingestion, processing, and analytics. These platforms typically include data sources, ETL (Extract, Transform, Load) pipelines, data lakes, and data warehouses. Boris highlighted the growing visibility of big data beyond its traditional boundaries, with data engineers playing increasingly critical roles. He noted the rise of cross-functional teams, inspired by Martin Fowler’s ideas, where subdomains drive team composition, fostering collaboration between data and backend engineers.

The convergence of big data and backend practices was a key theme. Boris pointed to technologies like Apache Kafka and Spark, which are now shared across both domains, enabling mutual learning. He emphasized that modern data platforms must balance complexity with efficiency, requiring specialized expertise to avoid pitfalls like project failures due to inadequate practices.

Architectural Innovations: From Lambda to Delta

Boris delved into big data architectures, starting with the Lambda architecture, which separates data processing into speed (real-time) and batch layers for high availability. While effective, Lambda’s complexity increases development and maintenance costs. As an alternative, he introduced the Kappa architecture, which simplifies processing by using a single streaming layer, reducing latency but potentially sacrificing availability. Boris then highlighted the emerging Delta architecture, which leverages data lakehouses—hybrid systems combining data lakes and warehouses. Technologies like Snowflake and Databricks support Delta, minimizing data hops and enabling both batch and streaming workloads with a single storage layer.

The Delta architecture’s rise reflects the growing popularity of data lakehouses, which Boris praised for their ability to handle raw, processed, and aggregated data efficiently. By reducing technological complexity, Delta enables faster development and lower maintenance, making it a compelling choice for modern data platforms.

Data Mesh and Governance

Boris introduced data mesh as a response to monolithic data architectures, drawing parallels with domain-driven design. Data mesh advocates for breaking down data platforms into bounded contexts, each owned by a dedicated team responsible for its pipelines and decisions. This approach avoids the pitfalls of monolithic pipelines, such as chaotic dependencies and scalability issues. Boris outlined four “temptations” to avoid: building monolithic pipelines, combining all pipelines into one application, creating chaotic pipeline networks, and mixing domains in data tables. Data mesh, he argued, promotes modularity and ownership, treating data as a product.

Data governance, or “data excellence,” was another critical focus. Boris stressed the importance of practices like data monitoring, quality validation, and retention policies. He advocated for a proactive approach, where engineers address these concerns early to ensure platform reliability and cost-efficiency. By treating data governance as a checklist, teams can mitigate risks and enhance platform maturity.

Serverless and ELT: Simplifying Big Data

Boris highlighted the shift toward serverless technologies and ELT paradigms. Serverless solutions, available across transformation, storage, and analytics tiers, reduce infrastructure management burdens, allowing faster time-to-market. He cited AWS and other cloud providers as enablers, noting that while not always cost-effective, serverless minimizes maintenance efforts. Similarly, ELT—where transformation occurs after loading data into a warehouse—leverages modern databases like Snowflake and BigQuery. Unlike traditional ETL, ELT reduces latency and complexity by using database capabilities for transformations, making it ideal for early-stage projects.

Boris also noted the resurgence of SQL as a domain-specific language across big data tiers, from transformation to governance. By building frameworks that express business logic in SQL, developers can accelerate feature delivery, despite SQL’s perceived limitations. He emphasized that well-designed SQL queries can be powerful, provided engineers avoid poorly structured code.

Productizing Big Data and Business Intelligence

The final trend Boris explored was the productization of big data solutions. He likened this to Intel’s microprocessor revolution, where standardized components accelerated hardware development. Companies like Absorber offer “data platform as a service,” enabling rapid construction of data pipelines through drag-and-drop interfaces. While limited for complex use cases, such solutions cater to organizations seeking quick deployment. Boris also discussed the rise of serverless business intelligence (BI) tools, which support ELT and allow cross-cloud data queries. These tools, like Mode and Tableau, enable self-service analytics, reducing the need for custom platforms in early stages.

Links:

PostHeaderIcon [DevoxxPL2022] Before It’s Too Late: Finding Real-Time Holes in Data • Chayim Kirshen

Chayim Kirshen, a veteran of the startup ecosystem and client manager at Redis, captivated audiences at Devoxx Poland 2022 with a dynamic exploration of real-time data pipeline challenges. Drawing from his experience with high-stakes environments, including a 2010 stock exchange meltdown, Chayim outlined strategies to ensure data integrity and performance in large-scale systems. His talk provided actionable insights for developers, emphasizing the importance of storing raw data, parsing in real time, and leveraging technologies like Redis to address data inconsistencies.

The Perils of Unclean Data

Chayim began with a stark reality: data is rarely clean. Recounting a 2010 incident where hackers compromised a major stock exchange’s API, he highlighted the cascading effects of unreliable data on real-time markets. Data pipelines face issues like inconsistent formats (CSV, JSON, XML), changing sources (e.g., shifting API endpoints), and service reliability, with modern systems often tolerating over a thousand minutes of downtime annually. These challenges disrupt real-time processing, critical for applications like stock exchanges or ad bidding networks requiring sub-100ms responses. Chayim advocated treating data as programmable code, enabling developers to address issues systematically rather than reactively.

Building Robust Data Pipelines

To tackle these issues, Chayim proposed a structured approach to data pipeline design. Storing raw data indefinitely—whether in S3, Redis, or other storage—ensures a fallback for reprocessing. Parsing data in real time, using defined schemas, allows immediate usability while preserving raw inputs. Bulk changes, such as SQL bulk inserts or Redis pipelines, reduce network overhead, critical for high-throughput systems. Chayim emphasized scheduling regular backfills to re-import historical data, ensuring consistency despite source changes. For example, a stock exchange’s ticker symbol updates (e.g., Fitbit to Google) require ongoing reprocessing to maintain accuracy. Horizontal scaling, using disposable nodes, enhances availability and resilience, avoiding single points of failure.

Real-Time Enrichment and Redis Integration

Data enrichment, such as calculating stock bid-ask spreads or market cap changes, should occur post-ingestion to avoid slowing the pipeline. Chayim showcased Redis, particularly its Gears and JSON modules, for real-time data processing. Redis acts as a buffer, storing raw JSON and replicating it to traditional databases like PostgreSQL or MySQL. Using Redis Gears, developers can execute functions within the database, minimizing network costs and enabling rapid enrichment. For instance, calculating a stock’s daily percentage change can run directly in Redis, streamlining analytics. Chayim highlighted Python-based tools like Celery for scheduling backfills and enrichments, allowing asynchronous processing and failure retries without disrupting the main pipeline.

Scaling and Future-Proofing

Chayim stressed horizontal scaling to distribute workloads geographically, placing data closer to users for low-latency access, as seen in ad networks. By using Redis for real-time writes and offloading to workers via Celery, developers can manage millions of daily entries, such as stock ticks, without performance bottlenecks. Scheduled backfills address data gaps, like API schema changes (e.g., integer to string conversions), by reprocessing raw data. This approach, combined with infrastructure-as-code tools like Terraform, ensures scalability and adaptability, allowing organizations to focus on business logic rather than data management overhead.

Links: