Big Data Hadoop

Posts

Showing posts from August, 2017

Hadoop Architecture

August 26, 2017

Hadoop Architecture Edge nodes are the interface between the Hadoop cluster and the outside network. For this reason, they’re sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools. They are often being used as staging areas for the data being transferred to hadoop system. Name Node is master and only store metadata of HDFS – the directory tree of all files in the file system and tracks the files across the cluster. Data Node is responsible for actual data in HDFS. Data Node and Name Node are in contant communication via Heartbeat. When a DataNode is down, it does not affect the availability of data or the cluster. NameNode will arrange for replication for the blocks managed by the DataNode that is not available. Secondary Name Node periodically reads the file system changes logs and apply them into the fsimage file, thus bringing it up to date. This al...

Logs & Errors Components of Talend

August 10, 2017

Logs & Errors Components Logs & Errors Components family allow you to log information about the execution of Talend Job. Below are the components of Logs and Errors tLogRow The tLogRow component is part of the Logs & Errors family of components. tLogRow allows you to write data, that is flowing through your Job (rows), to the console. This is useful component to aid debugging. You can simply drop it in to a data flow, to see a snapshot of data. tAsset It sends a non-blocking message to tAssertCatcher . tAssertCatcher . The tAssertCatcher component is part of the Logs & Errors family of components, and listens for non-blocking messages from tAssert and other die commands. tChronometerStart The tChronometerStart component is part of the Logs & Errors family of components and is used in conjunction with tChronometerStop . tChronometerStop The tChronometerStop component is part of the Logs & Errors family of componen...

Talend and Informatica!!

August 10, 2017

Talend and Informatica, both are ETL tool whose main task is to move data from Source to Target. Source -->> Target Both tools have GUI and transformation logic is defined. In Talend, both data and process flows are implemented together, seamlessly. We create a “Job” that define the process flow using a wide variety of component which cause data flow where as in informatica mapping is validated and saved in the repository then physical connections to the source and target objects are assigned. Talend is up to date on Big data technologies ( i.e Spark, Hive etc) where as informatica is lagging. Below are the keywords of Informatica and their Talend equivalents respectively: Informatica : Repository, Folder, Workflow, Worklet/Reusable Session, Session & Mapping, Transformations, Source and Target – Definitions & Connections Talend : Project Repository, Folder, Job, Joblet, Components, Components, Repository Metadata My journey with Talend has j...