In this blog, we will discuss about Hadoop federation, Hadoop architecture vs Hadoop Federated architecture and will talk about various issues solved by hdfs federation.
So let us first see why it is gaining so much popularity. To address this question we must know the problems in the existing architecture of Hadoop which led to the creation of Hadoop federation:
1) Availability: If we have a single namenode(master) managing all datanodes slaves) and it fails then the entire cluster goes down.
2) Scalability: With the Hadoop architecture we used to have single namenode and multiple datanodes. We were able to scale datanodes both horizontally and vertically but this was not possible with namenodes which could be scaled only vertically and not horizontally.
3) Isolation: A single NameNode offers no isolation in the multi-user environment. A single sample application can overload the single NameNode and slow down the critical production applications.
4) Performance: The performance of the entire Hadoop System depends on the throughput of the NameNode. Therefore, entire performance of all the HDFS operations depends on how many tasks the NameNode can handle at a particular instant of time.
Introducing HDFS Federation
HDFS Federation improves the existing HDFS architecture through a clear separation of namespace and storage, enabling generic block storage layer. It enables support for multiple namespaces in the cluster to improve scalability and isolation. Federation also opens up the architecture, expanding the applicability of HDFS cluster to new implementations and use cases.
Namenodes are federated, that is, all these NameNodes work independently and don’t require any coordination with each other.
Comparing Architecture: Hadoop vs Hadoop Federation
HDFS Architecture follows Master/Slave Topology where NameNode acts as a master daemon and is responsible for managing other slave nodes called DataNodes. Namenode stores meta-data i.e. number of blocks, their location, replicas. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the slave nodes, and assigns tasks to them.
In the above HDFS architecture, the entire cluster allows only single namespace. In that configuration, Single NameNode manages namespace. If NameNode fails, the cluster as a whole would be out of services. The cluster will be unavailable until the NameNode restarts or brought on a separate machine. This led to the question “Can we have/support multiple namenodes?” and its answer is HDFS Federation. Federation overcomes this limitation by adding support for many NameNode/ Namespaces to HDFS.
The Namenodes are federated, that is, the Namenodes are independent and don’t require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handle commands from the Namenodes.
Many namenodes (NN1, NN2…, NNn) manages many namespaces (NS1, NS2…, NSn) respectively. Each namespace has its own block pool (NS1 Has pool 1and so on). Block from pool 1 is stored on datanode 1 and so on.
In the end, I would say HDFS federation has been introduced to overcome the limitations of earlier HDFS implementation. Adding scalability at the namespace layer is the most important feature of HDFS federation architecture. But HDFS federation is also backward compatible, so the single namenode configuration will also work without any changes.
Keep Learning and Keep Sharing 🙂