Summary
The Hadoop Distributed File System (HDFS) is a core component and underpinning of the Hadoop cluster. HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage compound (with built-in redundancy at the software level). In this hour, we went into greater detail to understand the internal architecture of HDFS.
We looked at how data gets stored in the HDFS, how a file gets divided among one or more data blocks, how blocks are stored across multiple nodes for better performance and fault tolerance, and how replication factor is maintained. We also looked at the process of writing a file to HDFS and reading a file from HDFS, and we saw what happens behind the scenes. Then we used some commands to play around with HDFS for the storage and retrieval of files.
Finally, we looked at different limitations of HDFS in Hadoop 1.0 and how Hadoop 2.0 overcomes them. We then discussed in detail HDFS-related major features in Hadoop 2.0 in this hour.