Tuesday 8 September 2015

Apache Falcon – Basic Concepts



Apache Falcon use only three types of entities to describe all data management policies and pipelines. These entities are:

·         Cluster: Represents the “interfaces” to a Hadoop cluster
·         Feed: Defines a “dataset” File, Hive Table or Stream
·         Process: Consumes feeds, invokes processing logic & produces feeds



Using these three types of entities only we can manage replication, archival, retention of data and also handle job/process failures and late data arrival.

These Falcon entities–

  • Are easy and simple to define using XML.
  • Are modular - clusters, feeds & processes defined separately and then linked together and easy to re-use across multiple pipelines.
  • Can be configured for replication, late data arrival, archival and retention.


Using Falcon a complicated data pipeline like below


can be simplified to a few Falcon entities (which are further converted to multiple Oozie workflows by Falcon engine itself)

In my next post I will explain how we can define a Falcon process and perquisites for that.

Reference - http://www.slideshare.net/Hadoop_Summit/driving-enterprise-data-governance-for-big-data-systems-through-apache-falcon

1 comment:


  1. It’s interesting content and Great work. Definitely, it will be helpful for others. I would like to follow your blog. Keep post
    Check out:
    best hadoop training in omr
    hadoop training in sholinganallur
    big data training in chennai chennai tamil nadu

    ReplyDelete