May 27, 2009 SDF Hadoop, MapReduce, Cascade

SDForum copy.jpgwensel-chris-copy

On May 27, 2009 in Palo Alto the SDForum SAM SIG presented “Hadoop Architecture, MapReduce Patterns, and Best Practices with Cascading” by Chris K. Wensel, the founder of Concurrent Inc.

“Abstract: A rapid introduction to Hadoop architecture, MapReduce patterns, and best practices with Cascading.

Hadoop is an open source implementation of the Google MapReduce processing model and has been widely embraced by startups and established companies like Yahoo! and Amazon. Cascading, also an open source project, is an alternative API to MapReduce that allows developers to rapidly create sophisticated applications on the Hadoop platform.

Unfortunately the MapReduce model can be very complex to manipulate when attempting to perform tasks developers take for granted when using relational style databases, like joins and secondary sorting of grouped values.

Further, integrating Hadoop with external systems requires a deep knowledge of its internals. But this is where Hadoop clusters offer the most value, of off-loading data cleansing and data migration tasks from traditional tools and expensive load sensitive systems.

Cascading is an API that replaces the “Map” and “Reduce” primitives and their associated Key/Value algebra with functions, filters, and aggregators, and links them all together with a familiar columns and records model. And provides key processing primitives familiar to developers.

In this presentation, we will present the Hadoop architecture, how MapReduce influences that architecture and is used for common tasks, and how Cascading helps developers rapidly build sophisticated data processing and orchestration applications that can be very simply tested and executed.”


Copyright 2009 DJ Cline All rights reserved.