- Managing HDFS through the HDFS Shell Commands
- Using the dfsadmin Utility to Perform HDFS Operations
- Managing HDFS Permissions and Users
- Managing HDFS Storage
- Rebalancing HDFS Data
- Reclaiming HDFS Space
- Summary
Using the dfsadmin Utility to Perform HDFS Operations
The hdfs dfsadmin command lets you administer HDFS from the command line. While the hdfs dfs commands you learned about in the previous section help you manage HDFS files and directories, the dfsadmin command is useful for performing general HDFS-specific administrative tasks. It’s a good idea to become familiar with all the options that are available for the dfsadmin utility by issuing the following command:
$ hdfs dfsadmin -help hdfs dfsadmin performs DFS administrative commands. Note: Administrative commands can only be run with superuser permission. The full syntax is: hdfs dfsadmin [-report [-live] [-dead] [-decommissioning]] [-safemode <enter | leave | get | wait>] [-saveNamespace] ... $
If you issue the dfsadmin command with no options, it will list all the options that you can specify with the command. The dfsadmin –help command is highly useful, since it not only lists the command options, but also shows you what they are for and their syntax as well. Figure 9.6 shows a portion of the dfsadmin –help command.
Figure 9.6 The dfsadmin –help command reveals useful information for each dfsadmin command.
There are several useful dfsadmin command options. In the next few sections, let’s look at the following command options (other sections of this chapter and other chapters will discuss several other command options).
dfsadmin –report
dfsadmin –refreshNodes
dfsadmin -metasave
The dfsadmin –report Command
The dfsadmin tool helps you examine the HDFS cluster status. The dfsadmin –report command produces useful output that shows basic statistics of the cluster, including the status of the DataNodes and NameNode, the configured disk capacity and the health of the data blocks. Here’s a sample dfsadmin –report command:
$ hdfs dfsadmin -report Configured Capacity: 2068027170816000 (1.84 PB) #A Present Capacity: 2068027170816000 (1.84 PB) DFS Remaining: 562576619120381 (511.66 TB) #A DFS Used: 1505450551695619 (1.34 PB) #B DFS Used%: 72.80% #B Under replicated blocks: 1 #C Blocks with corrupt replicas: 0 Missing blocks: 1 Missing blocks (with replication factor 1): 9 #C ------------------------------------------------- Live datanodes (54): #D Name: 10.192.0.78:50010 (hadoop02.localhost) #E Hostname: hadoop02.localhost.com Rack: /rack3 #E Decommission Status : Normal #F Configured Capacity: 46015524438016 (41.85 TB) #G DFS Used: 33107988033048 (30.11 TB) Non DFS Used: 0 (0 B) DFS Remaining: 12907536404968 (11.74 TB) DFS Used%: 71.95% DFS Remaining%: 28.05% #G Configured Cache Capacity: 4294967296 (4 GB) #H Cache Used: 0 (0 B) Cache Remaining: 4294967296 (4 GB) Cache Used%: 0.00% Cache Remaining%: 100.00% #H Xceivers: 71 Last contact: Fri May 01 15:15:59 CDT 2015 ...
The dfsadmin –report command shows HDFS details for the entire cluster, as well as separately for each node in the cluster. The output of the DFS command shows the following at the cluster and the individual DataNode levels:
A summary of the HDFS storage allocation, including information about the configured, used and remaining space
If you’ve configured centralized HDFS caching, the used and remaining percentages of cache
Missing, corrupted and under-replicated blocks
As you’ll learn later in this book, the dfsadmin –report command’s output helps greatly in examining how balanced the HDFS data is, as well as helps you find out the extent of HDFS corruption (if it exists).
The dfsadmin –refreshNodes Command
The dfsadmin –refreshNodes command updates the NameNode with the list of DataNodes that are allowed to connect to the NameNode.
The NameNode reads the hostnames of the DataNode from the files pointed to by the dfs.hosts and the dfs.hosts.exclude configuration parameters in the hdfs-site.xml file. The dfs.hosts file lists all the hosts that are allowed to register with the NameNode. Any entries in the dfs.hosts.exclude file point to DataNodes that need to be decommissioned (you finalize the decommissioning after all the replicas from the node that is being decommissioned are replicated to other DataNodes).
The dfsadmin –metasave Command
The dfsadmin –metasave command provides more information than that provided by the dfsadmin –report command. This command gets you various block-related pieces of information such as:
Total number of blocks
Blocks waiting for replication
Blocks that are currently being replicated
Here’s how you run the dfsadmin –metasave command:
$ sudo -u hdfs hdfs dfsadmin -metasave test.txt Created metasave file test.txt in the log directory of namenode hadoop1 .localhost.com/10.192.2.21:8020 Created metasave file test.txt in the log directory of namenode hadoop02 .localhost.com/10.192.2.22:8020 $
When you run the dfsadmin –metasave command, it creates a file in the /var/log/ hadoop-hdfs directory on the server where you executed the command. The output file will contain the following information regarding the blocks:
58 files and directories, 17 blocks = 75 total Live Datanodes: 1 Dead Datanodes: 0 Metasave: Blocks waiting for replication: 0 Mis-replicated blocks that have been postponed: Metasave: Blocks being replicated: 0 Metasave: Blocks 0 waiting deletion from 0 datanodes. Metasave: Number of datanodes: 1 127.0.0.1:50010 IN 247241674752(230.26 GB) 323584(316 KB) 0% 220983930880(205.81 GB) Sat May 30 18:52:49 PDT 2015