Cascading provides means for defining arbitrarily large and complex, reusable, and fault tolerant data processing workflows, and a job planner for rendering those workflows into cluster executable jobs.
Cascading allows the developer to assemble predefined workflow tasks and tools, collect those workflows into a logical 'unit of work', and to efficiently schedule and execute them. Where these processes can scale laterally on clusters running in the local datacenter or on Amazon EC2.
The kinds of tasks and tools that can be built using Cascading can range from the simple 'log' parser, to modern Natural Language Processing (NLP). From traditional Extract, tranform, and load (ETL) to Data Warehousing. Even from Geophysical to Geospatial data managment.
Cascading currently relies on Hadoop to provide the storage and execution infrastructure. But the Cascading API insulates developers from the particulars of Hadoop, offering opportunites for Cascading to target different compute frameworks in the future without changes to the original processing workflow definitions.
Those familiar with Hadoop know it is an implementation of the MapReduce programming model. And any developer that has built any sort of application using MapReduce to solve 'real world' problems knows such applications can get complex very quickly. This is further aggravated by the need to 'think' in MapReduce throughout application development.
Thinking in MapReduce is typically unnatural, and tends to push the developer to constantly try to 'optimize' the application. This results in harder to read code, and likely more bugs. Further, most real world problems are a collection of dependent MapReduce jobs. Building them all and orchestrating them by hand does not scale well.
Cascading uses a 'pipe and filters' model for defining data processes. It efficiently supports splits, joins, grouping, and sorting. These are the only processing concepts the developer needs to think in.
During runtime, Cascading generates the minimum necessary number of MapReduce jobs, and executes them in the correct order locally, or on an Hadoop cluster. Any intermediate files are automatically cleared, and if target files already exist and aren't stale, those jobs can optionally be skipped.
We firmly believe applications should be built rapidly and designed as 'loosely coupled' as possible. Once an application is working and there are sufficient tests, only then should an application be optimized to remove any clear bottlenecks. Cascading supports this philosophy.
Cascading is also very suitable for 'ad-hoc' applications and scripts that might be needed to extract data from a Hadoop filesystem or to import data from various remote data sources. Or to just simply allow a user to poke around in various files and datasets.
Developers may also reuse existing Hadoop MapReduce jobs with Cascading, allowing them to participate with other Cascading dynamic MapReduce jobs on the cluster.
Read on about some of Cascadings key features.
Also see our documentation section for various examples and in depth tutorials listed on the sidebar.
Cascading is sponsored in part by these vendors:
YourKit is kindly supporting open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: YourKit Java Profiler and YourKit .NET Profiler.
Partager