Q&A
Q. What is HDFS, and what are the design goals?
A. HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage component of Hadoop (with built-in redundancy at the software level).
When HDFS was implemented originally, certain assumptions and design goals were discussed:
- Horizontal scalability—Based on the scale-out model. HDFS can run on thousands of nodes.
- Fault tolerance—Keeps multiple copies of data to recover from failure.
- Capability to run on commodity hardware—Designed to run on commodity hardware.
- Write once and read many times—Based on a concept of write once, read multiple times, with an assumption that once data is written, it will not be modified. Its focus is thus retrieving the data in the fastest possible way.
- Capability to handle large data sets and streaming data access—Targeted to small numbers of very large files for the storage of large data sets.
- Data locality—Every slave node in the Hadoop cluster has a data node (storage component) and a JobTracker (processing component). Processing is done where data exists, to avoid data movement across nodes of the cluster.
- High throughput—Designed for parallel data storage and retrieval.
- HDFS file system namespace—Uses a traditional hierarchical file organization in which any user or application can create directories and recursively store files inside them.
Q. In terms of storage, what does a name node contain and what do data nodes contain?
A. HDFS stores and maintains file system metadata and application data separately. The name node (master of HDFS) contains the metadata related to the file system (information about each file, as well as the history of changes to the file metadata). Data nodes (slaves of HDFS) contain application data in a partitioned manner for parallel writes and reads.
The name node contains an entire metadata called namespace (a hierarchy of files and directories) in physical memory, for quicker response to client requests. This is called the fsimage. Any changes into a transactional file is called an edit log. For persistence, both of these files are written to host OS drives. The name node simultaneously responds to the multiple client requests (in a multithreaded system) and provides information to the client to connect to data nodes to write or read the data. While writing, a file is broken down into multiple chunks of 64MB (by default, called blocks). Each block is stored as a separate file on data nodes. Based on the replication factor of a file, multiple copies or replicas of each block are stored for fault tolerance.
Q. What is the default data block placement policy?
A. By default, three copies, or replicas, of each block are placed, per the default block placement policy mentioned next. The objective is a properly load-balanced, fast-access, fault-tolerant file system:
- The first replica is written to the data node creating the file.
- The second replica is written to another data node within the same rack.
- The third replica is written to a data node in a different rack.
Q. What is the replication pipeline? What is its significance?
A. Data nodes maintain a pipeline for data transfer. Having said that, data node 1 does not need to wait for a complete block to arrive before it can start transferring it to data node 2 in the flow. In fact, the data transfer from the client to data node 1 for a given block happens in smaller chunks of 4KB. When data node 1 receives the first 4KB chunk from the client, it stores this chunk in its local repository and immediately starts transferring it to data node 2 in the flow. Likewise, when data node 2 receives the first 4KB chunk from data node 1, it stores this chunk in its local repository and immediately starts transferring it to data node 3, and so on. This way, all the data nodes in the flow (except the last one) receive data from the previous data node and, at the same time, transfer it to the next data node in the flow, to improve the write performance by avoiding a wait at each stage.
Q. What is client-side caching, and what is its significance when writing data to HDFS?
A. HDFS uses several optimization techniques. One is to use client-side caching, by the HDFS client, to improve the performance of the block write operation and to minimize network congestion. The HDFS client transparently caches the file into a temporary local file; when it accumulates enough data for a block size, the client reaches out to the name node. At this time, the name node responds by inserting the filename into the file system hierarchy and allocating data nodes for its storage. The client then flushes the block of data from the local, temporary file to the closest data node, and that data node transfers the block to other data nodes (as instructed by the name node, based on the replication factor of the file). This client-side caching avoids continuous use of the network and minimizes the risk of network congestion.
Q. How can you enable rack awareness in Hadoop?
A. You can make the Hadoop cluster rack aware by using a script that enables the master node to map the network topology of the cluster using the properties topology.script.file.name or net.topology.script.file.name, available in the core-site.xml configuration file. First, you must change this property to specify the name of the script file. Then you must write the script and place it in the file at the specified location. The script should accept a list of IP addresses and return the corresponding list of rack identifiers. For example, the script would take host.foo.bar as an argument and return /rack1 as the output.
Quiz
- What is the data block replication factor?
- What is block size, and how is it controlled?
- What is a checkpoint, and who performs this operation?
- How does a name node ensure that all the data nodes are functioning properly?
- How does a client ensures that the data it receives while reading is not corrupted?
- Is there a way to recover an accidently deleted file from HDFS?
- How can you access and manage files in HDFS?
- What two issues does HDFS encounter in Hadoop 1.0?
- What is a daemon?
Answers
- An application or a job can specify the number of replicas of a file that HDFS should maintain. The number of copies or replicas of each block of a file is called the replication factor of that file. The replication factor is configurable and can be changed at the cluster level or for each file when it is created, or even later for a stored file.
- When a client writes a file to a data node, it splits the file into multiple chunks, called blocks. This data partitioning helps in parallel data writes and reads. Block size is controlled by the dfs.blocksize configuration property in the hdfs-site.xml file and applies for files that are created without a block size specification. When creating a file, the client can also specify a block size specification to override the cluster-wide configuration.
- The process of generating a new fsimage by merging transactional records from the edit log to the current fsimage is called checkpoint. The secondary name node periodically performs a checkpoint by downloading fsimage and the edit log file from the name node and then uploading the new fsimage back to the name node. The name node performs a checkpoint upon restart (not periodically, though—only on name node start-up).
- Each data node in the cluster periodically sends heartbeat signals and a block-report to the name node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A block-report from a data node contains a list of all blocks on that specific data node.
- When writing blocks of a file, an HDFS client computes the checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS file system namespace. Later, while reading the blocks, the client references these checksums to verify that these blocks were not corrupted (corruption might happen because of faults in a storage device, network transmission faults, or bugs in the program). When the client realizes that a block is corrupted, it reaches out to another data node that has the replica of the corrupted block, to get another copy of the block.
- By default, no—but you can change this default behavior. You can enable the Trash feature of HDFS using two configuration properties: fs.trash.interval and fs.trash.checkpoint.interval in the core-site.xml configuration file. After enabling it, if you delete a file, it gets moved to the Trash folder and stays there, per the settings. If you happen to recover the file from there before it gets deleted, you are good; otherwise, you will lose the file.
- You can access the files and data stored in HDFS in many different ways. For example, you can use HDFS FS Shell commands, leverage the Java API available in the classes of the org.apache.hadoop.fs package, write a MapReduce job, or write Hive or Pig queries. In addition, you can even use a web browser to browse the files from an HDFS cluster.
- First, the name node in Hadoop 1.0 is a single point of failure. You can configure a secondary name node, but it’s not an active-passing configuration. The secondary name node thus cannot be used for failure, in case the name node fails. Second, as the number of data nodes grows beyond 4,000, the performance of the name node degrades, setting a kind of upper limit to the number of nodes in a cluster.
- The word daemon comes from the UNIX world. It refers to a process or service that runs in the background. On a Windows platform, we generally refer to it is as a service. For example, in HDFS, we have daemons such as name node, data node, and secondary name node.