Data Scientist: Big Data Fundamentals – What is Big Data? and Hadoop Fundamentals

Data Scientist: Big Data Fundamentals – What is Big Data? and Hadoop Fundamentals

Big data is defined as the collection of structured and unstructured data that are managed in large amounts and that follows the characteristics of having variety, velocity, veracity and volume.

Variety is related to the different format, types and extension in which the data is processed in open source frameworks such a hadoop in big data.

Volume is of data is large and it is normally processed in batch or real time processes in the different big data frameworks

Veracity relates the importance and relevance of the data that is being generated or that is more likely to be grown.

Velocity indicates the rate in which data is generated so that it is considered when understanding about the proper management of the data. It involves the time factor and is taken account in the processing configuration settings of clusters and other big data core components.

Hadoop Fundamentals

There are three types of modes that users can set in hadoop: standalone mode, pseudo-distributed mode or single node cluster, multi mode cluster

Hadoop is composed of three main core components to manage, store and analyze data. Client node, master node and worker node are the components in which data is involved whether we need to perform any of the action mention before.

Hadoop contains five daemons that are distributed along its three core components. These are namenode, secondary namenode, datanode, jobtracker and tasktracker.

Hadoop Core Components

Client Node

Client node is the section in which a set of configurations are established to load properly data into hadoop and receive the data once it was processed.

Master node

Master nodes contain two main components that are used to store and supervise the data, the hadoop distributed file system (HDFS) and Mapreduce. Both components are focused on maintain the functions that Namenode, Secondary NameNode and Jobtracker will perform in the master node. The port number to connect some of these daemons are Namenode:50070, Jobtracker: 50030 Tasktracker:50060

HDFS is the storage unit that hadoop uses to store different type of data in a distributed environment.

Namenode keeps record of the incoming data to the datanode and manages the access time of files called metadata. Metadata information is managed from different datanodes.

Secondary Namenode is used to backup the functioning of Namenode in case there is any failure.

Jobtracker monitors the job assignation and uses Mapreduce to verify that data is distributed parallely.

Worker node (Slave nodes)

Worker node or slave nodes are used to monitor the functioning of the job assigned so that they foster the data storage and computations. Multiple slave nodes can be linked to a master node to perform a specific job. A slave node is composed of a Datanode and Tasktracker.

Datanode is a slave to Namenode and here the job is performed with the corresponding data within the node. Datanode is responsible for storing the data.

Tasktracker is a slave node of the job tracker and monitors the tasks done within the slave node.

YARN is the abbreviation of yet another resource negotiator and it is composed of a resource manager and a node manager. Its functionality is related to ensuring job execution through the fostering of an execution environment and adequate management of resources.

Resource Manager is responsible for allocating the necessary resources in the master nodes (Node Managers)

Node Manager fosters that tasks are executed in the slave nodes.

Essentially, now you know how big data is processed, what big data is and how large volume of data is managed by using Hadoop. There are other big data tools and frameworks that you can use so that we suggest you to look for those in order to compare what is the way to manage your data. It is not difficult, but takes some time to find the right combination.