Warping Out with Hash-Table Containers in C++11
Four of the more interesting C++11 extensions to the Standard Template Library (STL) are new container types: namely, unordered (hash table) versions of the map, set, multimap, and multiset templates. These new containers provide faster lookup times than their standard ordered versions.
Just how much faster are they? Before I answer that question, let's review their purpose. These are all associated container templates, which store values by key rather than by order of insertion. Because they're templates, each is a container that can be built around almost any type of data you specify.
Map Templates
To use a map, you give a key value, and you "automagically" get back the value associated with it. In essence, a map is a rudimentary database. For example, suppose you create a map that associates a key value—employee name—with an integer containing the employee's salary. Specifically, we want a container to store string/integer pairs, in which each element can be looked up directly according to its string value.
Here are the necessary declarations:
#include <map> using namespace std; map<string, int> EmpSalaries;
The salary for each employee in the company can be stored in the container, called EmpSalaries. For example, let's associate the name JoeBloggs with a number:
EmpSalaries["JoeBloggs"] = 35000;
We'll give another employee, SueSmartie, a higher salary:
EmpSalaries["SueSmartie"] = 42000;
If no key value exists (for example, JoeBloggs isn't yet in the database), such a statement inserts the new element. If the key value already exists (JoeBloggs is in the database), the corresponding element is given the new associated value. In the earlier examples, JoeBloggs and SueSmartie would be given the new settings 35000 and 42000 if they were already in the database. If they weren't in the database, they would be inserted as new elements.
You can retrieve a value with a statement such as this:
// Print Joe's salary. cout << EmpSalaries["JoeBloggs"];
The use of the subscript operator makes a map look suspiciously like a random-access container such as an array. But in fact it's not. Lookup operations with a true random-access container (that is, an array) are virtually cost-free. There is almost no difference in access time between the following operations, especially if you're working with a good optimizing compiler:
i = 0; arr[n] = 0; *p = 0;
In contrast, each map element lookup has an associated cost; therefore, lookups should be minimized whenever possible. If you need to perform a number of operations on a map element, first you should get an iterator that points to it and do operations using that iterator:
map<string, int>::iterator iter; iter = EmpSalares.find("SueSmartie");
No Free Lunch
Lookups are intended to be magical, in that you give a key value and then expect the STL to access the appropriate item in a single step. But of course nothing magic ever really happens inside a computer. A map is a useful data structure, but everything has a cost. Consider the ways in which overhead is involved:
- An element in a map is found by traversing an internally maintained binary tree of ordered values. The length of time taken is proportional to log base-2 of the number of elements.
- After each insertion or deletion, the binary tree is kept relatively balanced by executing a tree-balancing algorithm. One popular algorithm is the red-black tree, which ensures that the farthest leaf is never more than twice as far from the root as the nearest leaf.
The STL templates do all this work so you never have to think about it, except in one respect: You must remember that cost in execution time.
It's useful to know that lookup costs increase logarithmically rather than linearly. Logarithmic growth means that as N (the number of elements) increases exponentially, lookup time increases arithmetically, as in the series 1, 2, 3, .... Therefore, if it takes time T to retrieve an item from a container of 100 elements, it takes roughly 2T to retrieve an item from a container of 10,000 elements, and 3T to retrieve an item from a container of a million elements. So if it costs 5 milliseconds to look up an item from a binary tree holding 100 elements, it costs roughly 15 milliseconds to look up an item from a tree holding a million elements.
Binary-tree storage is therefore especially efficient for situations in which you need to retrieve an item from an exceptionally large data set. If you stored all your values in a simple array and then searched the array every time you needed to look up a value, these lookups would take an unacceptably long time for, say, data sets of a million elements—because searching an array or linked list increases linearly in proportion to the number of elements.
But there's an even faster alternative: using hash tables.
How Hash Tables Save Time
In taking a key value, such as a string, and transforming it into a data location, it would be nice to have a more direct technique. We'd like to just translate the string into an address, and then go directly to the data rather than traversing a tree. Can't we just use a mathematical transformation that gives us an index number, hoping that the elements end up at different addresses?
That's the idea of hash tables, but the technique—although it makes for faster lookups generally—creates new issues that have to be addressed. For example, when you create a map, there should be no limitations on input. There should be no patterns of user input that potentially degrade performance. But it's not uncommon for data-entry staff to input elements in alphabetical order—such as entering the key values Babar, Babykins, Barney, Barnum, and Barzini in a single session, and not getting to the letters C and D until later. The point is that we can't assume the user won't enter such an input pattern.
Because these names are so similar, they're likely to be placed at the same address—unless we have a technique to prevent that from happening. When elements are assigned to the same location (and some of that is inevitable in any case), they're placed in the same bucket, which is a location that can hold more than one item. When a bucket holds multiple items, a linked list is used to separate them. Every time one of the items is accessed, this list must be searched in a linear fashion. Obviously you want to minimize such searches.
The bottom line is this: The more elements in the bucket, the longer the access time. Therefore, the goal is to place each element into its own bucket as much as possible.
Paradoxically, even though we're not trying to inconvenience the user, the problem of hash-table performance is similar to that of encryption. We don't want to frustrate the user by introducing chaos, but we need to resist the tendency of similar-looking keys to end up in the same bucket, no matter which pattern the data-entry person uses to input the values.
For this reason, hash-table containers are also unordered containers. The process of producing a hash number (called hashing) aims at producing pseudo-random results. The process isn't truly random, of course, because an input of x produces the same y every time. But two x values, however similar, ought to produce widely different y values. The only way to do that is to produce hash numbers that are seemingly random. There is an underlying order, but it cannot be an order that's perceived as useful by the user.
If you need to use a map, and only to look up individual items, the use of an unordered (hash-table) container can speed up your applications significantly. But beware of these potential downsides:
- Unordered containers tend to take up more space in memory, as there will always be some unused bucket positions.
- As I mentioned earlier, there is no meaningful order except that elements with identical keys will be next to each other. If two keys differ by so much as a single character, they probably won't be next to each other in the container's internal ordering.
- The key-value type must support a hashing function. The STL provides hashing functions for all primitive types as well as the string and basic_string types, and it takes care of these cases for you. But if you have a complex key-value type (for example, a key value consisting of two strings), you'll have to supply your own hashing function.
Let's assume that you don't need to iterate in a meaningful way through the container, and you only need to look up one element at a time. You have a good reason for using unordered containers. In that case, how much can you expect to improve lookup times?
I performed experiments to compare ordered-map to unordered-map performance. My methodology was to use integer keys 0, 10, 20, 30, and so on, up to but not including 99000. (99000 is easier than 100000 to enter without making a typo.) That sequence of integers generates close to 10,000 key values.
The result was that an unordered-map (hash-table) lookup took 40% less time on average than an ordered-map lookup in Microsoft Visual Studio C++. I last uploaded a new version of that compiler in November 2012.
To get a time metric, I used the C library's clock() function—which measures number of microseconds (thousandths of a second) since the program began. To support the clock() function, include the <ctime> (or <time.h>) header:
#include <iostream> #include <ctime> #include <map> #include <unorderedmap> using namespace std; map<int, int> o_map; // Ordered map unorderedmap<int, int> u_map; // UNordered map
With these declarations in place, you can build two maps, one ordered and one unordered, consisting of the keys 0, 10, 20, and so on, up to 99000. Finally, you can use the clock() function as follows, to time 9,900 lookups in an ordered map:
int sum = 0; int t1 = clock(); for (int i = 0; i < 99000; ++i) { sum += o_map[i*10]; } int t2 = clock(); cout << " ordered lookups took "; cout << t2 - t1 << " microsecs" << endl;
Similarly for the unordered container:
sum = 0; t1 = clock(); for (int i = 0; i < 99000; ++i) { sum += u_map[i*10]; } t2 = clock(); cout << "UNordered lookups took "; cout << t2 - t1 << " microsecs" << endl;
A single running of this code may not produce typical results. It's necessary to run it a number of times. If you do, you should get results that tend to be similar to the following (if you're using the November 2012 build of Microsoft C++):
ordered lookups took 546 microsecs UNordered lookups took 343 microsecs
From this evidence, we can see that lookups in an unordered container are definitely faster, in this case by two-tenths of a second. In today's world, two-tenths of a second can be a long time, especially when you're performing many lookups and want them to be as fast as possible. But remember that hash-table containers lose the ability to iterate through the elements in a meaningful way, which for some applications is a serious loss in functionality.
Fortunately, the coding techniques for ordered and unordered containers are so similar that it's possible to experiment easily in your own programs. Only a few methods are supported by one type of container (map, for example) but not by its corresponding container type (unordered map). You can therefore experiment easily by switching between one container type and another, as I've done here.
Summary
In the end, it's up to you: Use an ordered map and retain the ability to iterate through the container in a meaningful way, or use an unordered map and make your individual lookups possibly 50% faster or more. I encourage you to make the changes to your code, use the clock() function, and test your code as I've done. Hash tables can make a difference.