Big Data Analysis with MapReduce and Hadoop
In the evolution of data processing, we moved from flat files to relational databases and from relational databases to NoSQL databases. Essentially, as the amount of captured data increased, so did our needs, and traditional patterns no longer sufficed. The databases of old worked well with data that measured in megabytes and gigabytes, but now that companies realize “data is king,” the amount of captured data is measured in terabytes and petabytes. Even with NoSQL data stores, the question remains: How do we analyze that amount of data?
The most popular answer to this is: Hadoop. Hadoop is an open-source framework for developing and executing distributed applications that process very large amounts of data. Hadoop is meant to run on large clusters of commodity machines, which can be machines in your data center that you’re not using or even Amazon EC2 images. The danger, of course, in running on commodity machines is how to handle failure. Hadoop is architected with the assumption that hardware will fail and as such, it can gracefully handle most failures. Furthermore, its architecture allows it to scale nearly linearly, so as processing capacity demands increase, the only constraint is the amount of budget you have to add more machines to your cluster.
This article presents an overview of Hadoop’s architecture to describe how it can achieve these bold claims, and it demonstrates, from a high-level, how to build a MapReduce application.
Hadoop Architecture
At a high-level, Hadoop operates on the philosophy of pushing analysis code close to the data it is intended to analyze rather than requiring code to read data across a network. As such, Hadoop provides its own file system, aptly named Hadoop File System or HDFS. When you upload your data to the HDFS, Hadoop will partition your data across the cluster (keeping multiple copies of it in case your hardware fails), and then it can deploy your code to the machine that contains the data upon which it is intended to operate.
Like many NoSQL databases, HDFS organizes data by keys and values rather than relationally. In other words, each piece of data has a unique key and a value associated with that key. Relationships between keys, if they exist, are defined in the application, not by HDFS. And in practice, you’re going to have to think about your problem domain a bit differently in order realize the full power of Hadoop (see the next section on MapReduce).
The components that comprise Hadoop are:
- HDFS: The Hadoop file system is a distributed file system designed to hold huge amounts of data across multiple nodes in a cluster (where huge can be defined as files that are 100+ terabytes in size!) Hadoop provides both an API and a command-line interface to interacting with HDFS.
- MapReduce Application: The next section reviews the details of MapReduce, but in short, MapReduce is a functional programming paradigm for analyzing a single record in your HDFS. It then assembles the results into a consumable solution. The Mapper is responsible for the data processing step, while the Reducer receives the output from the Mappers and sorts the data that applies to the same key.
- Partitioner: The partitioner is responsible for dividing a particular analysis problem into workable chunks of data for use by the various Mappers. The HashPartioner is a partitioner that divides work up by “rows” of data in the HDFS, but you are free to create your own custom partitioner if you need to divide your data up differently.
- Combiner: If, for some reason, you want to perform a local reduce that combines data before sending it back to Hadoop, then you’ll need to create a combiner. A combiner performs the reduce step, which groups values together with their keys, but on a single node before returning the key/value pairs to Hadoop for proper reduction.
- InputFormat: Most of the time the default readers will work fine, but if your data is not formatted in a standard way, such as “key, value” or “key [tab] value”, then you will need to create a custom InputFormat implementation.
- OutputFormat: Your MapReduce applications will read data in some InputFormat and then write data out through an OutputFormat. Standard formats, such as “key [tab] value”, are supported out of the box, but if you want to do something else, then you need to create your own OutputFormat implementation.
Additionally, Hadoop applications are deployed to an infrastructure that supports its high level of scalability and resilience. These components include:
- NameNode: The NameNode is the master of the HDFS that controls slave DataNode daemons; it understands where all of your data is stored, how the data is broken into blocks, what nodes those blocks are deployed to, and the overall health of the distributed filesystem. In short, it is the most important node in the entire Hadoop cluster. Each cluster has one NameNode, and the NameNode is a single-point of failure in a Hadoop cluster.
- Secondary NameNode: The Secondary NameNode monitors the state of the HDFS cluster and takes “snapshots” of the data contained in the NameNode. If the NameNode fails, then the Secondary NameNode can be used in place of the NameNode. This does require human intervention, however, so there is no automatic failover from the NameNode to the Secondary NameNode, but having the Secondary NameNode will help ensure that data loss is minimal. Like the NameNode, each cluster has a single Secondary NameNode.
- DataNode: Each slave node in your Hadoop cluster will host a DataNode. The DataNode is responsible for performing data management: It reads its data blocks from the HDFS, manages the data on each physical node, and reports back to the NameNode with data management status.
- JobTracker: The JobTracker daemon is your liaison between your application and Hadoop itself. There is one JobTracker configured per Hadoop cluster and, when you submit your code to be executed on the Hadoop cluster, it is the JobTracker’s responsibility to build an execution plan. This execution plan includes determining the nodes that contain data to operate on, arranging nodes to correspond with data, monitoring running tasks, and relaunching tasks if they fail.
- TaskTracker: Similar to how data storage follows the master/slave architecture, code execution also follows the master/slave architecture. Each slave node will have a TaskTracker daemon that is responsible for executing the tasks sent to it by the JobTracker and communicating the status of the job (and a heartbeat) with the JobTracker.
- Map: The map step essentially solves a small problem: Hadoop’s partitioner divides the problem into small workable subsets and assigns those to map processes to solve.
- Reduce: The reducer combines the results of the mapping processes and forms the output of the MapReduce operation.
Figure 1 tries to put all of these components together in one pretty crazy diagram.
Figure 1 Hadoop application and infrastructure interactions
Figure 1 shows the relationships between the master node and the slave nodes. The master node contains two important components: the NameNode, which manages the cluster and is in charge of all data, and the JobTracker, which manages the code to be executed and all of the TaskTracker daemons. Each slave node has both a TaskTracker daemon as well as a DataNode: the TaskTracker receives its instructions from the JobTracker and executes map and reduce processes, while the DataNode receives its data from the NameNode and manages the data contained on the slave node. And of course there is a Secondary NameNode listening to updates from the NameNode.
MapReduce
MapReduce is a functional programming paradigm that is well suited to handling parallel processing of huge data sets distributed across a large number of computers, or in other words, MapReduce is the application paradigm supported by Hadoop and the infrastructure presented in this article. MapReduce, as its name implies, works in two steps:
My Map definition purposely used the work “essentially” because one of the things that give the Map step its name is its implementation. While it does solve small workable problems, the way that it does it is that it maps specific keys to specific values. For example, if we were to count the number of times each word appears in a book, our MapReduce application would output each word as a key and the value as the number of times it is seen. Or more specifically, the book would probably be broken up into sentences or paragraphs, and the Map step would return each word mapped either to the number of times it appears in the sentence (or to “1” for each occurrence of every word) and then the reducer would combine the keys by adding their values together.
Listing 1 shows a Java/Pseudo-code example about how the map and reduce functions might work to solve this problem.
Listing 1 - Java/Pseudocode for MapReduce
public void map( String name, String sentence, OutputCollector output ) { for( String word : sentence ) { output.collect( word, 1 ); } } public void reduce( String word, Iterator values, OutputCollector output ) { int sum = 0; while( values.hasNext() ) { sum += values.next().get(); } output.collect( word, sum ); }
Listing 1 does not contain code that actually works, but it does illustrate from a high-level how such a task would be implemented in a handful of lines of code. Prior to submitting your job to Hadoop, you would first load your data into Hadoop. It would then distribute your data, in blocks, to the various slave nodes in its cluster. Then when you did submit your job to Hadoop, it would distribute your code to the slave nodes and have each map and reduce task process data on that slave node. Your map task would iterate over every word in the data block passed to it (assuming a sentence in this example), and output the word as the key and the value as “1”. The reduce task would then receive all instances of values mapped to a particular key; for example, it may have 1,000 values of “1” mapped to the work “apple”, which would mean that there are 1,000 apples in the text. The reduce task sums up all of the values and outputs that as its result. Then your Hadoop job would be set up to handle all of the output from the various reduce tasks.
This way of thinking is quite a bit different from how you might have approached the problem without using MapReduce, but it will become clearer in the next article on writing MapReduce applications, in which we build several working examples.
Summary
This article described what Hadoop is and presented an overview of its architecture. Hadoop is an open-source framework for developing and executing distributed applications that process very large amounts of data. It provides the infrastructure that distributes data across a multitude of machines in a cluster and that pushes analysis code to nodes closest to the data being analyzed. Your job is to write MapReduce applications that leverage this infrastructure to analyze your data.
The next article in this series, Building a MapReduce Application with Hadoop, will demonstrate how to set up a development environment and build MapReduce applications, which should give you a good feel for how this new paradigm works. And then the final installment in this series will walk you through setting up and managing a Hadoop production environment.