HBase Data Analysis with MapReduce
The first two articles (See Part 1 and Part 2) introduced HBase, also known as the Hadoop database. HBase runs on top of the Hadoop Distributed File System (HDFS), which affords it a huge amount of scalability and provides two types of access: real-time direct access in a key/value store paradigm and offline or batch access via MapReduce. The first article, “Introduction to HBase,” introduced HBase, described how it manages and organizes data, demonstrated how to set up a local installation, and presented the HBase shell. The second article, ““Programming HBase with Java,” presented an overview of interacting with HBase using Java and described how to insert and update data, retrieve data, delete data, and perform a table scan. This article completes the series by reviewing Hadoop’s MapReduce programming paradigm and demonstrating how to build a MapReduce application that accesses data stored in HBase tables.
Introduction to Hadoop
Before we dive into building MapReduce applications that operate on HBase data, we need to start with a quick overview of Hadoop. For a deeper look, you can read a full three-part series about Hadoop that I wrote on InformIT here.
The basic premise underlying Hadoop is to push analysis code close to the data that it is intended to analyze. Hadoop applications are functional and divide work into two steps:
- Map: The map step solves a small problem: The Hadoop partitioner divides the problem into small workable subsets and assigns those to map processes to solve.
- Reduce: The reduce step combines the results of the mapping processes and forms the output.
Figure 1 presents a high-level overview of the entire end-to-end process.
Figure 1. Hadoop MapReduce process
Figure 1 show the following process:
- Hadoop splits the analysis data into chunks using an InputFormat.
- Each chunk is sent to a MapReduce process in the form of a value mapped to a key.
- Each key/value pair is sent to the map process and the map process derives a new key/value pair. For example, the input might be a line number (key) and a text string that contains an entry in a log file (value), and the output might be the hour of day of the entry (key) and a count of the number of observations of entries in that hour (value), which would be “1” because we’re analyzing only one line.
- Hadoop performs a shuffle and sort operation to group all mapped keys together. For example, it might send an entry with the key being the hour of day of the log file entry and the value being a list of all counts emitted by the map process.
- This new key/value pair is sent to the reduce process, which derives yet another key/value pair. For example, it might emit the hour of day and the total count of all observations (sum of all values in the value list).
- Finally, all key/value pairs emitted by the reduce process are written somewhere, such as to a file system or to an HBase table, by the OutputFormat.
Thinking in MapReduce can take a little while, but after you get it, it is actually quite elegant. I implemented this specific example in Applied Big Data Analysis in the Real World with MapReduce and Hadoop to get you started. Furthermore, I highly recommend MapReduce Design Patterns by Donald Miner and Adam Shook, which you can find on Safari.