- Running MapReduce Examples
- Running Basic Hadoop Benchmarks
- Summary and Additional Resources
Running Basic Hadoop Benchmarks
Many Hadoop benchmarks can provide insight into cluster performance. The best benchmarks are always those that reflect real application performance. The two benchmarks discussed in this section, terasort and TestDFSIO, provide a good sense of how well your Hadoop installation is operating and can be compared with public data published for other Hadoop systems. The results, however, should not be taken as a single indicator for system-wide performance on all applications.
The following benchmarks are designed for full Hadoop cluster installations. These tests assume a multi-disk HDFS environment. Running these benchmarks in the Hortonworks Sandbox or in the pseudo-distributed single-node install from Chapter 2 is not recommended because all input and output (I/O) are done using a single system disk drive.
Running the Terasort Test
The terasort benchmark sorts a specified amount of randomly generated data. This benchmark provides combined testing of the HDFS and MapReduce layers of a Hadoop cluster. A full terasort benchmark run consists of the following three steps:
- Generating the input data via teragen program.
- Running the actual terasort benchmark on the input data.
- Validating the sorted output data via the teravalidate program.
In general, each row is 100 bytes long; thus the total amount of data written is 100 times the number of rows specified as part of the benchmark (i.e., to write 100GB of data, use 1 billion rows). The input and output directories need to be specified in HDFS. The following sequence of commands will run the benchmark for 50GB of data as user hdfs. Make sure the /user/hdfs directory exists in HDFS before running the benchmarks.
Run teragen to generate rows of random data to sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar teragen 500000000
/user/hdfs/TeraGen-50GB
Run terasort to sort the database.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar terasort
/user/hdfs/TeraGen-50GB /user/hdfs/TeraSort-50GB
Run teravalidate to validate the sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar teravalidate
/user/hdfs/TeraSort-50GB /user/hdfs/TeraValid-50GB
To report results, the time for the actual sort (terasort) is measured and the benchmark rate in megabytes/second (MB/s) is calculated. For best performance, the actual terasort benchmark should be run with a replication factor of 1. In addition, the default number of terasort reducer tasks is set to 1. Increasing the number of reducers often helps with benchmark performance. For example, the following command will instruct terasort to use four reducer tasks:
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar terasort
-Dmapred.reduce.tasks=4 /user/hdfs/TeraGen-50GB /user/hdfs/TeraSort-50GB
Also, do not forget to clean up the terasort data between runs (and after testing is finished). The following command will perform the cleanup for the previous example:
$ hdfs dfs -rm -r -skipTrash Tera*
Running the TestDFSIO Benchmark
Hadoop also includes an HDFS benchmark application called TestDFSIO. The TestDFSIO benchmark is a read and write test for HDFS. That is, it will write or read a number of files to and from HDFS and is designed in such a way that it will use one map task per file. The file size and number of files are specified as command-line arguments. Similar to the terasort benchmark, you should run this test as user hdfs.
Similar to terasort, TestDFSIO has several steps. In the following example, 16 files of size 1GB are specified. Note that the TestDFSIO benchmark is part of the hadoop-mapreduce-client-jobclient.jar. Other benchmarks are also available as part of this jar file. Running it with no arguments will yield a list. In addition to TestDFSIO, NNBench (load testing the NameNode) and MRBench (load testing the MapReduce framework) are commonly used Hadoop benchmarks. Nevertheless, TestDFSIO is perhaps the most widely reported of these benchmarks. The steps to run TestDFSIO are as follows:
Run TestDFSIO in write mode and create data.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-client-jobclient-tests.jar
TestDFSIO -write -nrFiles 16 -fileSize 1000
Example results are as follows (date and time prefix removed).
fs.TestDFSIO: ----- TestDFSIO ----- : write fs.TestDFSIO: Date & time: Thu May 14 10:39:33 EDT 2015 fs.TestDFSIO: Number of files: 16 fs.TestDFSIO: Total MBytes processed: 16000.0 fs.TestDFSIO: Throughput mb/sec: 14.890106361891005 fs.TestDFSIO: Average IO rate mb/sec: 15.690713882446289 fs.TestDFSIO: IO rate std deviation: 4.0227035201665595 fs.TestDFSIO: Test exec time sec: 105.631
Run TestDFSIO in read mode.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-client-jobclient-tests.jar
TestDFSIO -read -nrFiles 16 -fileSize 1000
Example results are as follows (date and time prefix removed). The large standard deviation is due to the placement of tasks in the cluster on a small four-node cluster.
fs.TestDFSIO: ----- TestDFSIO ----- : read fs.TestDFSIO: Date & time: Thu May 14 10:44:09 EDT 2015 fs.TestDFSIO: Number of files: 16 fs.TestDFSIO: Total MBytes processed: 16000.0 fs.TestDFSIO: Throughput mb/sec: 32.38643494172466 fs.TestDFSIO: Average IO rate mb/sec: 58.72880554199219 fs.TestDFSIO: IO rate std deviation: 64.60017624360337 fs.TestDFSIO: Test exec time sec: 62.798
Clean up the TestDFSIO data.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-client-jobclient-tests.jar
TestDFSIO -clean
Running the TestDFSIO and terasort benchmarks help you gain confidence in a Hadoop installation and detect any potential problems. It is also instructive to view the Ambari dashboard and the YARN web GUI (as described previously) as the tests run.
Managing Hadoop MapReduce Jobs
Hadoop MapReduce jobs can be managed using the mapred job command. The most important options for this command in terms of the examples and benchmarks are -list, -kill, and -status. In particular, if you need to kill one of the examples or benchmarks, you can use the mapred job -list command to find the job-id and then use mapred job -kill <job-id> to kill the job across the cluster. MapReduce jobs can also be controlled at the application level with the yarn application command (see Chapter 10, “Basic Hadoop Administration Procedures”). The possible options for mapred job are as follows:
$ mapred job Usage: CLI <command> <args> [-submit <job-file>] [-status <job-id>] [-counter <job-id> <group-name> <counter-name>] [-kill <job-id>] [-set-priority <job-id> <priority>]. Valid values for priorities are: VERY_HIGH HIGH NORMAL LOW VERY_LOW [-events <job-id> <from-event-#> <#-of-events>] [-history <jobHistoryFile>] [-list [all]] [-list-active-trackers] [-list-blacklisted-trackers] [-list-attempt-ids <job-id> <task-type> <task-state>]. Valid values for <task-type> are REDUCE MAP. Valid values for <task-state> are running, completed [-kill-task <task-attempt-id>] [-fail-task <task-attempt-id>] [-logs <job-id> <task-attempt-id>] Generic options supported are -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -fs <local|namenode:port> specify a namenode -jt <local|resourcemanager:port> specify a ResourceManager -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions]