HDFS Commands, HDFS Permissions and HDFS Storage

Jan 25, 2017

␡

Managing HDFS through the HDFS Shell Commands
Using the dfsadmin Utility to Perform HDFS Operations
Managing HDFS Permissions and Users
Managing HDFS Storage
Rebalancing HDFS Data
Reclaiming HDFS Space
Summary

⎙ Print

Page 1 of 7 Next >

This chapter is about managing HDFS storage with HDFS shell commands. You’ll also learn about the dfsadmin utility, a key ally in managing HDFS. The chapter also shows how to manage HDFS file permissions and create HDFS users. As a Hadoop administrator, one of your key tasks is to manage HDFS storage. The chapter shows how to check HDFS usage and how to allocate space quotas to HDFS users. The chapter also discusses when and how to rebalance HDFS data, as well as how you can reclaim HDFS space.

This chapter is from the book 

Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS

Learn More Buy

Working with HDFS is one of the most common tasks for someone administering a Hadoop cluster. Although you can access HDFS in multiple ways, the command line is the most common way to administer HDFS storage.

Managing HDFS users by granting them appropriate permissions and allocating HDFS space quotas to users are some of the common user-related administrative tasks you’ll perform on a regular basis. The chapter shows how HDFS permissions work and how to grant and revoke space quotas on HDFS directories.

Besides the management of users and their HDFS space quotas, there are other aspects of HDFS that you need to manage. This chapter also shows how to perform maintenance tasks such as periodically balancing the HDFS data to distribute it evenly across the cluster, as well as how to gain additional space in HDFS when necessary.

Managing HDFS through the HDFS Shell Commands

You can access HDFS in various ways:

From the command line using simple Linux-like file system commands, as well as through a web interface, called WebHDFS
Using the HttpFS gateway to access HDFS from behind a firewall
Through Hue’s File Browser (and Cloudera Manager and Ambari, if you’re using Cloudera, or Hortonwork’s Hadoop distributions)

Figure 9.1 summarizes the various ways in which you can access HDFS. Although you have multiple ways to access HDFS, it’s a good bet that you’ll often be working from the command line to manage your HDFS files and directories. You can access the HDFS file system from the command line with the hdfs dfs file system commands.

Figure 9.1 The many ways in which you can access HDFS

File Systems other than HDFS

It’s important to keep in mind that HDFS file systems are only one way that Hadoop implements a file system. There are several other Java implementations of file systems that work with Hadoop. These include local file systems (file), WebHDFS (WebHDFS), HAR (Hadoop archive files), View (viewfs), S3 (s3a) and others. For each file system, Hadoop uses a different URI scheme for the file system instance in order to connect with it. For example, you list the files in the local system by using the file URI scheme, as shown here:

$ hdfs dfs –ls file:///

This will get you a listing of files stored on the local Linux file system.

Using the `hdfs dfs` Utility to Manage HDFS

You use the hdfs dfs utility to issue HDFS commands in Hadoop. Here’s the usage of this command:

hdfs dfs [GENERIC_OPTIONS] [COMMAND_OPTIONS]

Using the hdfs dfs utility, you can run file system commands on the file system supported in Hadoop, which happens to be HDFS.

You can use two types of HDFS shell commands:

The first set of shell commands are very similar to common Linux file system commands such as ls, mkdir and so on.
The second set of HDFS shell commands are specific to HDFS, such as the command that lets you set the file replication factor.

You can access the HDFS file system from the command line, over the web, or through application code. HDFS file system commands are in many cases quite similar to familiar Linux file system commands. For example, the command hdfs dfs –cat /path/to/hdfs/file works the same as a Linux cat command, by printing the output of a file onto the screen.

Internally HDFS uses a pretty sophisticated algorithm for its file system reads and writes, in order to support both reliability and high throughput. For example, when you issue a simple put command that writes a file to an HDFS directory, Hadoop will need to write that data fast to three nodes (by default).

You can access the HDFS shell by typing hdfs dfs <command> at the command line. You specify actions with subcommands that are prefixed with a minus (-) sign, as in dfs –cat for displaying a file’s contents.

You may view all available HDFS commands by simply invoking the hdfs dfs command with no options, as shown here:

$ hdfs dfs
Usage: hadoop fs [generic options]
        [-appendToFile <localsrc> ... <dst>]
        [-cat [-ignoreCrc] <src> ...]

Figure 9.2 shows all the available HDFS dfs commands.

Figure 9.2 The hdfs dfs commands.

However, it’s the hdfs dfs –help command that’s truly useful to a beginner and even quite a few “experts”—this command clearly explains all the hdfs dfs commands. Figure 9.3 shows how the help utility clearly explains the various file copy options that you can use with the hdfs dfs command.

Figure 9.3 How the hdfs dfs –help command helps you understand the syntax of the various options of the hdfs dfs command

In the following sections, I show you how to

List HDFS files and directories
Use the HDFS STAT command
Create an HDFS directory
Remove HDFS files and directories
Change file and directory ownership
Change HDFS file permissions

Listing HDFS Files and Directories

As with regular Linux file systems, use the ls command to list HDFS files. You can specify various options with the ls command, as shown here:

$ hdfs dfs -usage ls
Usage: hadoop fs [generic options] -ls [-d] [-h] [-R] [<path> ...]
bash-4.2$
Here's what the options stand for:
-d: Directories are listed as plain files.
-h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).
-R: Recursively list subdirectories encountered.
-t: Sort output by modification time (most recent first).
-S: Sort output by file size.
-r: Reverse the sort order.
-u: Use access time rather than modification time for display and sorting.

Listing Both Files and Directories

If the target of the ls command is a file, it shows the statistics for the file, and if it’s a directory, it lists the contents of that directory. You can use the following command to get a directory listing of the HDFS root directory:

$ hdfs dfs –ls /
Found 8 items
drwxr-xr-x   - hdfs   hdfs         0 2013-12-11 09:09 /data
drwxr-xr-x   - hdfs   supergroup   0 2015-05-04 13:22 /lost+found
drwxrwxrwt   - hdfs   hdfs         0 2015-05-20 07:49 /tmp
drwxr-xr-x   - hdfs   supergroup   0 2015-05-07 14:38 /user
...
#

For example, the following command shows all files within a directory ordered by filenames:

$ hdfs dfs -ls /user/hadoop/testdir1

Alternately, you can specify the HDFS URI when listing files:

$ hdfs dfs –ls hdfs://<hostname>:9000/user/hdfs/dir1/

You can also specify multiple files or directories with the ls command:

$ hdfs dfs -ls /user/hadoop/testdir1 /user/hadoop/testdir2

Listing Just Directories

You can view information that pertains just to directories by passing the –d option:

$ hdfs dfs -ls -d /user/alapati
drwxr-xr-x   - hdfs supergroup      0 2015-05-20 12:27 /user/alapati
$

The following two ls command examples show file information:

$ hdfs dfs –ls /user/hadoop/testdir1/test1.txt
$ hdfs dfs –ls /hdfs://<hostname>:9000/user/hadoop/dir1/

Note that when you list HDFS files, each file will show its replication factor. In this case, the file test1.txt has a replication factor of 3 (the default replication factor).

$ hdfs dfs -ls /user/alapati/
-rw-r--r--   3 hdfs supergroup      12 2016-05-24 15:44 /user/alapati/test.txt

Using the `hdfs stat` Command to Get Details about a File

Although the hdfs dfs –ls command lets you get the file information you need, there are times when you need specific bits of information from HDFS. When you run the hdfs dfs –ls command, it returns the complete path of the file. When you want to see only the base name, you can use the hdfs –stat command to view only specific details of a file.

You can format the hdfs –stat command with the following options:

%b Size of file in bytes
%F Will return "file", "directory", or "symlink" depending on the type of inode
%g Group name
%n Filename
%o HDFS Block size in bytes ( 128MB by default )
%r Replication factor
%u Username of owner
%y Formatted mtime of inode
%Y UNIX Epoch mtime of inode

In the following example, I show how to confirm if a file or directory exists.

# hdfs dfs -stat "%n" /user/alapati/messages
messages

If you run the hdfs –stat command against a directory, it tells you that the name you specify is indeed a directory.

$ hdfs dfs -stat "%b %F %g %n %o %r %u %y %Y" /user/alapati/test2222
0 directory supergroup test2222 0 0 hdfs 2015-08-24 20:44:11 1432500251198
$

The following examples show how you can view different types of information with the hdfs dfs –stat command when compared to the hdfs dfs –ls command. Note that I specify all the -stat command options here.

$ hdfs dfs -ls /user/alapati/test2222/true.txt
-rw-r--r--   2 hdfs supergroup         12 2015-08-24 15:44 /user/alapati/test2222/
true.txt
$

$ hdfs dfs -stat "%b %F %g %n %o %r %u %y %Y" /user/alapati/test2222/true.txt
12 regular file supergroup true.txt 268435456 2 hdfs 2015-05-24 20:44:11 1432500251189
$

I’d be remiss if I didn’t add that you can also access HDFS through Hue’s Job Browser, as shown in Figure 9.4.

Figure 9.4 Hue’s File Browser, showing how you can access HDFS from Hue

Creating an HDFS Directory

Creating an HDFS directory is similar to how you create a directory in the Linux file system. Issue the mkdir command to create an HDFS directory. This command takes path URIs as arguments to create one or more directories, as shown here:

$ hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2

The directory /user/hadoop must already exist for this command to succeed.

Here’s another example that shows how to create a directory by specifying a directory with a URI.

$ hdfs dfs –mkdir hdfs://nn1.example.com/user/hadoop/dir

If you want to create parent directories along the path, specify the –p option, with the hdfsdfs -mkdir command, just as you would do with its cousin, the Linux mkdir command.

$ hdfs dfs -mkdir –p /user/hadoop/dir1

In this command, by specifying the –p option, I create both the parent directory hadoop and its subdirectory dir1 with a single mkdir command.

Removing HDFS Files and Directories

HDFS file and directory removal commands work similar to the analogous commands in the Linux file system. The rm command with the –R option removes a directory and everything under that directory in a recursive fashion. Here’s an example.

$ hdfs dfs -rm -R /user/alapati
15/05/05 12:59:54 INFO fs.TrashPolicyDefault: Namenode trash configuration:
Deletion interval = 1440 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://hadoop01-ns/user/alapati' to trash at: hdfs://hadoop01-ns/user/
hdfs/.Trash/Current
$

I issued an rm –R command, and I can verify that the directory I want to remove is indeed gone from HDFS. However, the output of the rm –R command shows that the directory is still saved for me in case I need it—in HDFS’s trash directory. The trash directory serves as a built-in safety mechanism that protects you against accidental file and directory removals. If you haven’t already enabled trash, please do so ASAP!

Even when you enable trash, sometimes the trash interval is set too low, so make sure that you configure the fs.trash.interval parameter in the hdfs-site.xml file appropriately. For example, setting this parameter to 14,400 means Hadoop will retain the deleted items in trash for a period of ten days.

You can view the deleted HDFS files currently in the trash directory by issuing the following command:

$ hdfs dfs –ls /user/sam/.Trash

You can use the –rmdir option to remove an empty directory:

$ hdfs dfs –rmdir /user/alapati/testdir

If the directory you wish to remove isn’t empty, use the -rm –R option as shown earlier.

If you’ve configured HDFS trash, any files or directories that you delete are moved to the trash directory and retained in there for the length of time you’ve configured for the trash directory. On some occasions, such as when a directory fills up beyond the space quota you assigned for it, you may want to permanently delete files immediately. You can do so by issuing the dfs –rm command with the –skipTrash option:

$ hdfs dfs –rm /user/alapati/test –skipTrash

The –skipTrash option will bypass the HDFS trash facility and immediately delete the specified files or directories.

You can empty the trash directory with the expunge command:

$ hdfs dfs –expunge

All files in trash that are older than the configured time interval are deleted when you issue the expunge command.

Changing File and Directory Ownership and Groups

You can change the owner and group names with the –chown command, as shown here:

$ hdfs dfs –chown sam:produsers  /data/customers/names.txt

You must be a super user to modify the ownership of files and directories.

HDFS file permissions work very similar to the way you modify file and directory permissions in Linux. Figure 9.5 shows how to issue the familiar chmod, chown and chgrp commands in HDFS.

Figure 9.5 Changing file mode, ownership and group with HDFS commands

Changing Groups

You can change just the group of a user with the chgrp command, as shown here:

$ sudo –u hdfs hdfs dfs –chgrp marketing /users/sales/markets.txt

Changing HDFS File Permissions

You can use the chmod command to change the permissions of a file or directory. You can use standard Linux file permissions. Here’s the general syntax for using the chmod command:

hdfs dfs –chmod [-R] <mode> <file/dir>

You must be a super user or the owner of a file or directory to change its permissions.

With the chgrp, chmod and chown commands you can specify the –R option to make recursive changes through the directory structure you specify.

In this section, I’m using HDFS commands from the command line to view and manipulate HDFS files and directories. However, there’s an even easier way to access HDFS, and that’s through Hue, the web-based interface, which is extremely easy to use and which lets you perform HDFS operations through a GUI. Hue comes with a File Browser application that lets you list and create files and directories, download and upload files from HDFS and copy/move files. You can also use Hue’s File Browser to view the output of your MapReduce jobs, Hive queries and Pig scripts.

While the hdfs dfs utility lets you manage the HDFS files and directories, the hdfs dfsadmin utility lets you perform key HDFS administrative tasks. In the next section, you’ll learn how to work with the dfsadmin utility to manage your cluster.

Page 1 of 7 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Privacy Notice

Overview

Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information

Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security

Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children

This site is not directed to children under the age of 13.

Marketing

Pearson may send or direct marketing communications to users, provided that

Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
Such marketing is consistent with applicable law and Pearson's legal obligations.
Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out

Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure

Pearson may disclose personal information, as follows:

As required by law.
With the consent of the individual (or their parent, if the individual is a minor)
In response to a subpoena, court order or legal process, to the extent permitted or required by law
To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
To investigate or address actual or suspected fraud or other illegal activities
To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links

This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact

Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice

We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020

Email Address