Posts Tagged ‘DeltaArchitecture’
[DevoxxPL2022] Accelerating Big Data: Modern Trends Enable Product Analytics • Boris Trofimov
Boris Trofimov, a big data expert from Sigma Software, delivered an insightful presentation at Devoxx Poland 2022, exploring modern trends in big data that enhance product analytics. With experience building high-load systems like the AOL data platform for Verizon Media, Boris provided a comprehensive overview of how data platforms are evolving. His talk covered architectural innovations, data governance, and the shift toward serverless and ELT (Extract, Load, Transform) paradigms, offering actionable insights for developers navigating the complexities of big data.
The Evolving Role of Data Platforms
Boris began by demystifying big data, often misconstrued as a magical solution for business success. He clarified that big data resides within data platforms, which handle ingestion, processing, and analytics. These platforms typically include data sources, ETL (Extract, Transform, Load) pipelines, data lakes, and data warehouses. Boris highlighted the growing visibility of big data beyond its traditional boundaries, with data engineers playing increasingly critical roles. He noted the rise of cross-functional teams, inspired by Martin Fowler’s ideas, where subdomains drive team composition, fostering collaboration between data and backend engineers.
The convergence of big data and backend practices was a key theme. Boris pointed to technologies like Apache Kafka and Spark, which are now shared across both domains, enabling mutual learning. He emphasized that modern data platforms must balance complexity with efficiency, requiring specialized expertise to avoid pitfalls like project failures due to inadequate practices.
Architectural Innovations: From Lambda to Delta
Boris delved into big data architectures, starting with the Lambda architecture, which separates data processing into speed (real-time) and batch layers for high availability. While effective, Lambda’s complexity increases development and maintenance costs. As an alternative, he introduced the Kappa architecture, which simplifies processing by using a single streaming layer, reducing latency but potentially sacrificing availability. Boris then highlighted the emerging Delta architecture, which leverages data lakehouses—hybrid systems combining data lakes and warehouses. Technologies like Snowflake and Databricks support Delta, minimizing data hops and enabling both batch and streaming workloads with a single storage layer.
The Delta architecture’s rise reflects the growing popularity of data lakehouses, which Boris praised for their ability to handle raw, processed, and aggregated data efficiently. By reducing technological complexity, Delta enables faster development and lower maintenance, making it a compelling choice for modern data platforms.
Data Mesh and Governance
Boris introduced data mesh as a response to monolithic data architectures, drawing parallels with domain-driven design. Data mesh advocates for breaking down data platforms into bounded contexts, each owned by a dedicated team responsible for its pipelines and decisions. This approach avoids the pitfalls of monolithic pipelines, such as chaotic dependencies and scalability issues. Boris outlined four “temptations” to avoid: building monolithic pipelines, combining all pipelines into one application, creating chaotic pipeline networks, and mixing domains in data tables. Data mesh, he argued, promotes modularity and ownership, treating data as a product.
Data governance, or “data excellence,” was another critical focus. Boris stressed the importance of practices like data monitoring, quality validation, and retention policies. He advocated for a proactive approach, where engineers address these concerns early to ensure platform reliability and cost-efficiency. By treating data governance as a checklist, teams can mitigate risks and enhance platform maturity.
Serverless and ELT: Simplifying Big Data
Boris highlighted the shift toward serverless technologies and ELT paradigms. Serverless solutions, available across transformation, storage, and analytics tiers, reduce infrastructure management burdens, allowing faster time-to-market. He cited AWS and other cloud providers as enablers, noting that while not always cost-effective, serverless minimizes maintenance efforts. Similarly, ELT—where transformation occurs after loading data into a warehouse—leverages modern databases like Snowflake and BigQuery. Unlike traditional ETL, ELT reduces latency and complexity by using database capabilities for transformations, making it ideal for early-stage projects.
Boris also noted the resurgence of SQL as a domain-specific language across big data tiers, from transformation to governance. By building frameworks that express business logic in SQL, developers can accelerate feature delivery, despite SQL’s perceived limitations. He emphasized that well-designed SQL queries can be powerful, provided engineers avoid poorly structured code.
Productizing Big Data and Business Intelligence
The final trend Boris explored was the productization of big data solutions. He likened this to Intel’s microprocessor revolution, where standardized components accelerated hardware development. Companies like Absorber offer “data platform as a service,” enabling rapid construction of data pipelines through drag-and-drop interfaces. While limited for complex use cases, such solutions cater to organizations seeking quick deployment. Boris also discussed the rise of serverless business intelligence (BI) tools, which support ELT and allow cross-cloud data queries. These tools, like Mode and Tableau, enable self-service analytics, reducing the need for custom platforms in early stages.