What is the Hadoop file system:
Hadoop File System was developed from the basis of distributed file system design. Compared with other distributed systems, HDFS is highly faulted tolerant and designed using low-cost hardware. HDFS can hold a very large amount of data and provides easier access. To store such huge data, the files are stored under multiple machines. These files are stored redundantly to save the system from possible data losses in case of any failure in the system. It also helps the applications for parallel processing.
Features of Hadoop:
- It suits distributed storage and processing.
- Provides a command interface to interact with Hadoop Distributed File System.
- The built-in servers of name node and data node which help the users to easily check the status of the cluster.
- It helps in providing streaming access to file system data.
- It also provides file permissions and authentication.
Important components in HDFS Architecture are:
- Name Node
- Data Nodes
HDFS is a type of block-structured file system where each HDFS file is broken into blocks of the fixed size usually 128 MB which is stored across various data nodes on the cluster. To access a file on HDFS, multiple data nodes need to be referenced and the list of the data nodes which need to be accessed is determined by the file system metadata which is stored on Name Node.
HDFS’s fsck command is used to get the files and blocks details of a file system.
$ Hadoop fsck / -files –blocks
Advantages of Block:
- Quick Seek Time
- Ability to store large files in a block
- Fault Tolerance can be achieved with HDFS blocks.
2. Name Node:
Name Node is termed as a single point of contact for accessing files in the Hadoop Distributed File System. It helps in determining the block ids and locations for data access. Thus, the Name Node plays a Master role in Master/Slaves Architecture whereas Data Nodes acts as slaves. File System metadata is also stored on Name-node. File System Metadata contains File names, File Permissions, and locations of each block of files. Thus, Metadata is small in size and also fits into the Main Memory of a computer machine. So, it is stored in the Main Memory of Name Node to allow fast access to the system.
Important Components of Name Node are as follows;
- FsImage: FsImage is a file on Name Node’s Local File System. It contains the entire HDFS file system namespace including mapping of blocks to files and file system properties.
- EditLog: EditLog is a Transaction Log residing on Name Node’s Local File System and it contains a record of entry for every change that occurs to File System Metadata.
Only One Active Name Node is allowed on a cluster at any point in time in the system.
3. Data Nodes:
Data Nodes are considered as the slaves part of Master/Slaves Architecture and on which actual HDFS files are stored in the form of fixed-size chunks of data which are called blocks. Data Nodes is capable of serving read requests and write requests of clients on HDFS files. It also performs block creation, replication, and deletions.
Data Nodes Failure Recovery:
Each data node on a cluster periodically sends a heartbeat message to the name node and that message is used by the name node to discover the data node failures based on missing heartbeats. The name node considers data nodes without recent Heartbeats as dead and does not dispatch any new I/O requests to them. Since the data located at a dead data node is no longer available to HDFS. The data node death may cause the replication factor of some blocks to fall below their specified values. The name node tracks which blocks must be re-replicated, and initiates replication whenever necessary constantly.
Working of Hadoop:
Sometimes, the storage can get very huge such that the disks are arranged in different racks and are connected through switches. If all replicas are stored in the same rack, and if the switch accessing that rack fails, all the replicas will be unavailable defying the purpose of having redundancy. Hadoop Distributed File System has a feature of rack awareness through which the Name Node knows which rack each data file is on.
Hadoop also has an intelligent behavior in terms of self-healing because if one of the DataNode goes down, then the heartbeat or status message from that DataNode to the NameNode will be ceased. After a few minutes, the NameNode will consider that DataNode to be dead and whatever tasks that were running on that DataNode will get transferred on another data node.
- Block Placement:
- Replication management:
- Block Scanner
When choosing the nodes for writing the data, NameNode will follow the replica management policy and the default HDFS replica placement policy is as follows:
- No DataNode contains more than one replica of any block in the system.
- No rack contains more than two replicas of the same block, provided there are sufficient racks available on the cluster.
One main responsibility of NameNode is to ensure that all data blocks must have a proper number of replicas. During summarizing block reports from DataNodes, NameNode detects that a block has become under-replicated or over-replicated. When a block becomes over-replicated, the NameNode chooses a replica to be removed.
When a block becomes under-replicated, it is put in the replication priority queue to create more replicas of that data block in the file system. A block with only one replica will have the highest priority. A background thread scans the head of the replication queue periodically and also decides where to place new replicas based on block placement policy as that is stated above.
Hadoop Distributed File System block placement strategy does not take into account Data Node disk space utilization. Imbalance also occurs when new nodes are added to the cluster. A balancer is a tool that helps to balance disk space usage on an HDFS cluster. The tool is deployed as an application program that is being run by the cluster administrator. It iteratively moves replicas from Data Nodes with higher utilization to Data Nodes that have lower utilization.
Each DataNode runs a block scanner that scans its block replicas periodically and verifies that stored checksums match to the block data.cAlso, If a client reads a complete block and checksum verification succeeds, it notifies the Data Node. The Data Node treats it as a verification of the replica in the file system.
- As HDFS is designed based on the notion, “Write Once, Read multiple times”, once a file is written to HDFS, Then it can’t be updated. But delete, append, and read operations can be performed on HDFS files.
- HDFS is not suitable for large numbers of small-sized files but best suits large-sized files. Because file system namespace maintained by Name node is limited by its main memory capacity as the namespace is stored in Namenode’s main memory and a large number of files will result in a big fsimage file.
Learning Hadoop has huge scope in the industries. Industries are in search of candidates who are well-versed in Hadoop concepts. If you are choosing Hadoop as your field there are many Hadoop certification training in Chennai which helps to gain knowledge in Hadoop technology. Hadoop training and placement in Chennai guides and makes you strong and shows the way for making your future bright.