Recent Posts
Archives

PostHeaderIcon [ScalaDaysNewYork2016] Large-Scale Graph Analysis with Scala and Akka

Ben Fonarov, a Big Data specialist at Capital One, presented a compelling case study at Scala Days New York 2016 on building a large-scale graph analysis engine using Scala, Akka, and HBase. Ben detailed the architecture and implementation of Athena, a distributed time-series graph system designed to deliver integrated, real-time data to enterprise users, addressing the challenges of data overload in a banking environment.

Addressing Enterprise Data Needs

Ben Fonarov opened by outlining the motivation behind Athena: the need to provide integrated, real-time data to users at Capital One. Unlike traditional table-based thinking, Athena represents data as a graph, modeling entities like accounts and transactions to align with business concepts. Ben highlighted the challenges of data overload, with multiple data warehouses and ETL processes generating vast datasets. Athena’s visual interface allows users to define graph schemas, ensuring data is accessible in a format that matches their mental models.

Architectural Considerations

Ben described two architectural approaches to building Athena. The naive implementation used a single actor to process queries, which was insufficient for production-scale loads. The robust solution leveraged an Akka cluster, distributing query processing across nodes for scalability. A query parser translated user requests into graph traversals, while actors managed tasks and streamed results to users. This design ensured low latency and scalability, handling up to 200 billion nodes efficiently.

Streaming and Optimization

A key feature of Athena, Ben explained, was its ability to stream results in real time, avoiding the batch processing limitations of frameworks like TinkerPop’s Gremlin. By using Akka’s actor-based concurrency, Athena processes queries incrementally, delivering results as they are computed. Ben discussed optimizations, such as limiting the number of nodes per actor to prevent bottlenecks, and plans to integrate graph algorithms like PageRank to enhance analytical capabilities.

Future Directions and Community Engagement

Ben concluded by sharing future plans for Athena, including adopting a Gremlin-like DSL for graph traversals and integrating with tools like Spark and H2O. He emphasized the importance of community feedback, inviting developers to join Capital One’s data team to contribute to Athena’s evolution. Running on AWS EC2, Athena represents a scalable solution for enterprise graph analysis, poised to transform how banks handle complex data relationships.

Links:

Leave a Reply