Enterprise challenges are addressed by Apache Falcon which is linked to Hadoop data replication, lineage tracing, business continuity by deploying a framework of processing and data management. The data life cycle is managed by falcon centrally for facilitating the data in a quicker way for business disaster and continuity recovery which offers the foundation compliance and audit by tracking the collection lineage of audit logs.
- Define Falcon
An enterprise is permitted by Falcon for processing a single massive dataset which is present in HDFS in various ways- for batch, streaming, and interactive applications. With the increase in the value of Hadoop data the significance of cleaning the data preparation for enterprise intelligence tools which removes the cluster when it outlives the useful life.
The development and management of the Falcon simplify the data processing with a higher layer of abstraction by seeking the complex coding outside data processing by offering the out of the box data management services.
There are other HDP components that are leveraged by Falcon framework like Oozie, Pig, HDFS by enabling the simplified management by offering a deploy, define and manage data pipelines framework.
- Working of Falcon
As a standalone server, Falcon runs and as a part of Hadoop cluster.
Entity specifications are created by a user and submit to Falcon by the Command Line Interface (CLI) or REST API. The entity specifications are transformed by Falcon into repeated actions via a Hadoop workflow scheduler. With the schedule, all the functions and workflow state management needs are delegated.
Here are the following entities that define the part of the Falcon framework:
- Cluster: The interfaces are represented by a Hadoop cluster
- Feed: A dataset is defined similarly to Hive tables or HDFS files with replication, location schedule, and retention policy.
- Process: Processes and Consumes Feeds
- Falcon’s purpose: Replication
Data replication is involved with various enterprise conversations at some point and these ranges from the simple “I require multiple copies of data” to the excessive complex “ I require certain subsets of staged, presented and intermediate data replicated among clusters in a failover scenario and that requires each dataset for having a various retention period.
These problems are typically solved by custom-built applications, which can be very much time consuming, with a good challenge for maintenance and is error-prone. The custom code is avoided by Falcon and rather express the processing pipeline and replication policies within a simple declarative language.
In the scenario given below the staged data travels via a sequence of processing which is taken by business intelligence applications. A replica of this data is needed by the customer in a secondary cluster. When compared to primary cluster, the secondary cluster is smaller with lots of subset of data to be replicated.
The datasets are fined by Falcon and process the workflow at designated points. The data of the secondary cluster is replicated by Falcon. The processing of Falcon is orchestrated and scheduled by the replication events. The final result is in case the failover critical staged and presented stored in the secondary cluster.
Join DBA Course to learn more about other technologies and tools.
Stay connected to CRB Tech for more technical optimization and other updates and information.
Reference site: hotonworks