HDFS Commands, HDFS Permissions and HDFS Storage
- Managing HDFS through the HDFS Shell Commands
- Using the dfsadmin Utility to Perform HDFS Operations
- Managing HDFS Permissions and Users
- Managing HDFS Storage
- Rebalancing HDFS Data
- Reclaiming HDFS Space
- Summary
This chapter is about managing HDFS storage with HDFS shell commands. You’ll also learn about the dfsadmin utility, a key ally in managing HDFS. The chapter also shows how to manage HDFS file permissions and create HDFS users. As a Hadoop administrator, one of your key tasks is to manage HDFS storage. The chapter shows how to check HDFS usage and how to allocate space quotas to HDFS users. The chapter also discusses when and how to rebalance HDFS data, as well as how you can reclaim HDFS space.
Working with HDFS is one of the most common tasks for someone administering a Hadoop cluster. Although you can access HDFS in multiple ways, the command line is the most common way to administer HDFS storage.
Managing HDFS users by granting them appropriate permissions and allocating HDFS space quotas to users are some of the common user-related administrative tasks you’ll perform on a regular basis. The chapter shows how HDFS permissions work and how to grant and revoke space quotas on HDFS directories.
Besides the management of users and their HDFS space quotas, there are other aspects of HDFS that you need to manage. This chapter also shows how to perform maintenance tasks such as periodically balancing the HDFS data to distribute it evenly across the cluster, as well as how to gain additional space in HDFS when necessary.
Managing HDFS through the HDFS Shell Commands
You can access HDFS in various ways:
From the command line using simple Linux-like file system commands, as well as through a web interface, called WebHDFS
Using the HttpFS gateway to access HDFS from behind a firewall
Through Hue’s File Browser (and Cloudera Manager and Ambari, if you’re using Cloudera, or Hortonwork’s Hadoop distributions)
Figure 9.1 summarizes the various ways in which you can access HDFS. Although you have multiple ways to access HDFS, it’s a good bet that you’ll often be working from the command line to manage your HDFS files and directories. You can access the HDFS file system from the command line with the hdfs dfs file system commands.
Figure 9.1 The many ways in which you can access HDFS
Using the hdfs dfs Utility to Manage HDFS
You use the hdfs dfs utility to issue HDFS commands in Hadoop. Here’s the usage of this command:
hdfs dfs [GENERIC_OPTIONS] [COMMAND_OPTIONS]
Using the hdfs dfs utility, you can run file system commands on the file system supported in Hadoop, which happens to be HDFS.
You can use two types of HDFS shell commands:
The first set of shell commands are very similar to common Linux file system commands such as ls, mkdir and so on.
The second set of HDFS shell commands are specific to HDFS, such as the command that lets you set the file replication factor.
You can access the HDFS file system from the command line, over the web, or through application code. HDFS file system commands are in many cases quite similar to familiar Linux file system commands. For example, the command hdfs dfs –cat /path/to/hdfs/file works the same as a Linux cat command, by printing the output of a file onto the screen.
Internally HDFS uses a pretty sophisticated algorithm for its file system reads and writes, in order to support both reliability and high throughput. For example, when you issue a simple put command that writes a file to an HDFS directory, Hadoop will need to write that data fast to three nodes (by default).
You can access the HDFS shell by typing hdfs dfs <command> at the command line. You specify actions with subcommands that are prefixed with a minus (-) sign, as in dfs –cat for displaying a file’s contents.
You may view all available HDFS commands by simply invoking the hdfs dfs command with no options, as shown here:
$ hdfs dfs Usage: hadoop fs [generic options] [-appendToFile <localsrc> ... <dst>] [-cat [-ignoreCrc] <src> ...]
Figure 9.2 shows all the available HDFS dfs commands.
Figure 9.2 The hdfs dfs commands.
However, it’s the hdfs dfs –help command that’s truly useful to a beginner and even quite a few “experts”—this command clearly explains all the hdfs dfs commands. Figure 9.3 shows how the help utility clearly explains the various file copy options that you can use with the hdfs dfs command.
Figure 9.3 How the hdfs dfs –help command helps you understand the syntax of the various options of the hdfs dfs command
In the following sections, I show you how to
List HDFS files and directories
Use the HDFS STAT command
Create an HDFS directory
Remove HDFS files and directories
Change file and directory ownership
Change HDFS file permissions
Listing HDFS Files and Directories
As with regular Linux file systems, use the ls command to list HDFS files. You can specify various options with the ls command, as shown here:
$ hdfs dfs -usage ls Usage: hadoop fs [generic options] -ls [-d] [-h] [-R] [<path> ...] bash-4.2$ Here's what the options stand for: -d: Directories are listed as plain files. -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864). -R: Recursively list subdirectories encountered. -t: Sort output by modification time (most recent first). -S: Sort output by file size. -r: Reverse the sort order. -u: Use access time rather than modification time for display and sorting.
Listing Both Files and Directories
If the target of the ls command is a file, it shows the statistics for the file, and if it’s a directory, it lists the contents of that directory. You can use the following command to get a directory listing of the HDFS root directory:
$ hdfs dfs –ls / Found 8 items drwxr-xr-x - hdfs hdfs 0 2013-12-11 09:09 /data drwxr-xr-x - hdfs supergroup 0 2015-05-04 13:22 /lost+found drwxrwxrwt - hdfs hdfs 0 2015-05-20 07:49 /tmp drwxr-xr-x - hdfs supergroup 0 2015-05-07 14:38 /user ... #
For example, the following command shows all files within a directory ordered by filenames:
$ hdfs dfs -ls /user/hadoop/testdir1
Alternately, you can specify the HDFS URI when listing files:
$ hdfs dfs –ls hdfs://<hostname>:9000/user/hdfs/dir1/
You can also specify multiple files or directories with the ls command:
$ hdfs dfs -ls /user/hadoop/testdir1 /user/hadoop/testdir2
Listing Just Directories
You can view information that pertains just to directories by passing the –d option:
$ hdfs dfs -ls -d /user/alapati drwxr-xr-x - hdfs supergroup 0 2015-05-20 12:27 /user/alapati $
The following two ls command examples show file information:
$ hdfs dfs –ls /user/hadoop/testdir1/test1.txt $ hdfs dfs –ls /hdfs://<hostname>:9000/user/hadoop/dir1/
Note that when you list HDFS files, each file will show its replication factor. In this case, the file test1.txt has a replication factor of 3 (the default replication factor).
$ hdfs dfs -ls /user/alapati/ -rw-r--r-- 3 hdfs supergroup 12 2016-05-24 15:44 /user/alapati/test.txt
Using the hdfs stat Command to Get Details about a File
Although the hdfs dfs –ls command lets you get the file information you need, there are times when you need specific bits of information from HDFS. When you run the hdfs dfs –ls command, it returns the complete path of the file. When you want to see only the base name, you can use the hdfs –stat command to view only specific details of a file.
You can format the hdfs –stat command with the following options:
%b Size of file in bytes %F Will return "file", "directory", or "symlink" depending on the type of inode %g Group name %n Filename %o HDFS Block size in bytes ( 128MB by default ) %r Replication factor %u Username of owner %y Formatted mtime of inode %Y UNIX Epoch mtime of inode
In the following example, I show how to confirm if a file or directory exists.
# hdfs dfs -stat "%n" /user/alapati/messages messages
If you run the hdfs –stat command against a directory, it tells you that the name you specify is indeed a directory.
$ hdfs dfs -stat "%b %F %g %n %o %r %u %y %Y" /user/alapati/test2222 0 directory supergroup test2222 0 0 hdfs 2015-08-24 20:44:11 1432500251198 $
The following examples show how you can view different types of information with the hdfs dfs –stat command when compared to the hdfs dfs –ls command. Note that I specify all the -stat command options here.
$ hdfs dfs -ls /user/alapati/test2222/true.txt -rw-r--r-- 2 hdfs supergroup 12 2015-08-24 15:44 /user/alapati/test2222/ true.txt $ $ hdfs dfs -stat "%b %F %g %n %o %r %u %y %Y" /user/alapati/test2222/true.txt 12 regular file supergroup true.txt 268435456 2 hdfs 2015-05-24 20:44:11 1432500251189 $
I’d be remiss if I didn’t add that you can also access HDFS through Hue’s Job Browser, as shown in Figure 9.4.
Figure 9.4 Hue’s File Browser, showing how you can access HDFS from Hue
Creating an HDFS Directory
Creating an HDFS directory is similar to how you create a directory in the Linux file system. Issue the mkdir command to create an HDFS directory. This command takes path URIs as arguments to create one or more directories, as shown here:
$ hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
The directory /user/hadoop must already exist for this command to succeed.
Here’s another example that shows how to create a directory by specifying a directory with a URI.
$ hdfs dfs –mkdir hdfs://nn1.example.com/user/hadoop/dir
If you want to create parent directories along the path, specify the –p option, with the hdfsdfs -mkdir command, just as you would do with its cousin, the Linux mkdir command.
$ hdfs dfs -mkdir –p /user/hadoop/dir1
In this command, by specifying the –p option, I create both the parent directory hadoop and its subdirectory dir1 with a single mkdir command.
Removing HDFS Files and Directories
HDFS file and directory removal commands work similar to the analogous commands in the Linux file system. The rm command with the –R option removes a directory and everything under that directory in a recursive fashion. Here’s an example.
$ hdfs dfs -rm -R /user/alapati 15/05/05 12:59:54 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1440 minutes, Emptier interval = 0 minutes. Moved: 'hdfs://hadoop01-ns/user/alapati' to trash at: hdfs://hadoop01-ns/user/ hdfs/.Trash/Current $
I issued an rm –R command, and I can verify that the directory I want to remove is indeed gone from HDFS. However, the output of the rm –R command shows that the directory is still saved for me in case I need it—in HDFS’s trash directory. The trash directory serves as a built-in safety mechanism that protects you against accidental file and directory removals. If you haven’t already enabled trash, please do so ASAP!
Even when you enable trash, sometimes the trash interval is set too low, so make sure that you configure the fs.trash.interval parameter in the hdfs-site.xml file appropriately. For example, setting this parameter to 14,400 means Hadoop will retain the deleted items in trash for a period of ten days.
You can view the deleted HDFS files currently in the trash directory by issuing the following command:
$ hdfs dfs –ls /user/sam/.Trash
You can use the –rmdir option to remove an empty directory:
$ hdfs dfs –rmdir /user/alapati/testdir
If the directory you wish to remove isn’t empty, use the -rm –R option as shown earlier.
If you’ve configured HDFS trash, any files or directories that you delete are moved to the trash directory and retained in there for the length of time you’ve configured for the trash directory. On some occasions, such as when a directory fills up beyond the space quota you assigned for it, you may want to permanently delete files immediately. You can do so by issuing the dfs –rm command with the –skipTrash option:
$ hdfs dfs –rm /user/alapati/test –skipTrash
The –skipTrash option will bypass the HDFS trash facility and immediately delete the specified files or directories.
You can empty the trash directory with the expunge command:
$ hdfs dfs –expunge
All files in trash that are older than the configured time interval are deleted when you issue the expunge command.
Changing File and Directory Ownership and Groups
You can change the owner and group names with the –chown command, as shown here:
$ hdfs dfs –chown sam:produsers /data/customers/names.txt
You must be a super user to modify the ownership of files and directories.
HDFS file permissions work very similar to the way you modify file and directory permissions in Linux. Figure 9.5 shows how to issue the familiar chmod, chown and chgrp commands in HDFS.
Figure 9.5 Changing file mode, ownership and group with HDFS commands
Changing Groups
You can change just the group of a user with the chgrp command, as shown here:
$ sudo –u hdfs hdfs dfs –chgrp marketing /users/sales/markets.txt
Changing HDFS File Permissions
You can use the chmod command to change the permissions of a file or directory. You can use standard Linux file permissions. Here’s the general syntax for using the chmod command:
hdfs dfs –chmod [-R] <mode> <file/dir>
You must be a super user or the owner of a file or directory to change its permissions.
With the chgrp, chmod and chown commands you can specify the –R option to make recursive changes through the directory structure you specify.
In this section, I’m using HDFS commands from the command line to view and manipulate HDFS files and directories. However, there’s an even easier way to access HDFS, and that’s through Hue, the web-based interface, which is extremely easy to use and which lets you perform HDFS operations through a GUI. Hue comes with a File Browser application that lets you list and create files and directories, download and upload files from HDFS and copy/move files. You can also use Hue’s File Browser to view the output of your MapReduce jobs, Hive queries and Pig scripts.
While the hdfs dfs utility lets you manage the HDFS files and directories, the hdfs dfsadmin utility lets you perform key HDFS administrative tasks. In the next section, you’ll learn how to work with the dfsadmin utility to manage your cluster.