Posts Tagged ‘TomWhite’
[DevoxxBE2013] Building Hadoop Big Data Applications
Tom White, an Apache Hadoop committer and author of Hadoop: The Definitive Guide, explores the complexities of building big data applications with Hadoop. As an engineer at Cloudera, Tom introduces the Cloudera Development Kit (CDK), an open-source project simplifying Hadoop application development. His session navigates common pitfalls, best practices, and CDK’s role in streamlining data processing across Hadoop’s ecosystem.
Hadoop’s growth has introduced diverse components like Hive and Impala, challenging developers to choose appropriate tools. Tom demonstrates CDK’s unified abstractions, enabling seamless integration across engines, and shares practical examples of low-latency queries and fault-tolerant batch processing.
Navigating Hadoop’s Ecosystem
Tom outlines Hadoop’s complexity: HDFS, MapReduce, Hive, and Impala serve distinct purposes. He highlights pitfalls like schema mismatches across tools. CDK abstracts these, allowing a single dataset definition for Hive and Impala.
This unification, Tom shows, reduces errors, streamlining development.
Best Practices for Application Development
Tom advocates defining datasets in Java, ensuring compatibility across engines. He demonstrates CDK’s API, creating a dataset accessible by both Hive’s batch transforms and Impala’s low-latency queries.
Best practices include modular schemas and automated metadata synchronization, minimizing manual refreshes.
CDK’s Role in Simplifying Development
The CDK, Tom explains, centralizes dataset management. A live demo shows indexing data for Impala’s millisecond-range queries and Hive’s fault-tolerant ETL processes. This abstraction enhances productivity, letting developers focus on logic.
Tom notes ongoing CDK improvements, like automatic metastore refreshes, enhancing usability.
Choosing Between Hive and Impala
Tom contrasts Impala’s low-latency, non-fault-tolerant queries with Hive’s robust batch processing. For ad-hoc summaries, Impala excels; for ETL transforms, Hive’s fault tolerance shines.
He demonstrates a CDK dataset serving both, offering flexibility for diverse workloads.