Using Library Algorithms in C++
Many container operations apply to more than one type of container. For example, vector, string, and list allow us to insert elements by calling insert and remove elements by calling erase. These operations have the same interface for each type that supports them. For that matter, many container operations also apply to the string class.
Every containeras well as the string classprovides companion iterator types, which let us navigate through a container and examine its elements. Again, the library ensures that every iterator that supplies an operation does so through the same interface. For example, we can use the ++ operator to advance any type of iterator from one element to the next, we can use the * operator to access the element associated with any type of iterator, and so on.
In this article, we'll see how the library exploits these common interfaces to provide a collection of standard algorithms. By using these algorithms, we can avoid writing (and rewriting) the same code over and over again. More important, we can write programs that are smaller and simpler than we would write otherwisesometimes astonishingly so.
Like containers and iterators, algorithms also use consistent interface conventions. This consistency lets us learn a few of the algorithms and then apply that knowledge to others as the need arises. In this chapter, we'll use several of the library algorithms to solve problems related to processing strings and student grades. Along the way, we'll cover most of the core concepts in the algorithm library.
Unless we say otherwise, the <algorithm> header defines all the algorithms that we introduce in this chapter.
Analyzing strings
We can use this loop to concatenate two character pictures:
for (vector<string>::const_iterator it = bottom.begin(); it ! = bottom.end(); ++it) ret.push_back(*it);
This loop is equivalent to inserting a copy of the elements of bottom at the end of ret, an operation that vectors provides directly:
ret.insert(ret.end(), bottom.begin(), bottom.end());
This problem has an even more general solution: We can separate the notion of copying elements from that of inserting elements at the end of a container, as follows:
copy(bottom.begin(), bottom.end(), back_inserter(ret));
Here, copy is an example of a generic algorithm, and back_inserter is an example of an iterator adaptor.
A generic algorithm is an algorithm that is not part of any particular kind of container, but instead it takes a cue from its arguments' types about how to access the data it uses. The standard library's generic algorithms usually take iterators among their arguments, which they use to manipulate the elements of the underlying containers. So, for example, the copy algorithm takes three iterators, which we'll call begin, end, and out, and copies all the elements in the range [begin, end) to a sequence of elements starting at out and extending as far as necessary. In other words,
copy(begin, end, out);
has the same effect as
while (begin != end) *out++ = *begin++;
except that the while body changes the values of the iterators, and copy doesn't.
Before we describe iterator adaptors, we should note that this loop depends on the use of the postfix version of the increment operators. These operators differ from the prefix versions, which we have used up to now, in that begin++ returns a copy of the original value of begin, incrementing the stored value of begin as a side effect. In other words,
it = begin++;
is equivalent to
it = begin; ++begin;
The increment operators have the same precedence as *, and they are both right-associative, which means that *out++ has the same meaning as *(out++). Thus,
*out++ = *begin++;
is equivalent to the more verbose
{ *out = *begin; ++out; ++begin; }
Let's return to iterator adaptors, which are functions that yield iterators with properties that are related to their arguments in useful ways. The iterator adaptors are defined in <iterator>. The most common iterator adaptor is back_inserter, which takes a container as its argument and yields an iterator that, when used as a destination, appends values to the container. For example, back_inserter(ret) is an iterator that, when used as a destination, appends elements to ret. Therefore,
copy(bottom.begin(), bottom.end(), back_inserter(ret));
copies all of the elements of bottom and appends them to the end of ret. After this function completes, the size of ret will have increased by bottom.size().
Notice that we could not call
// errorretis not an iterator copy(bottom.begin(), bottom.end(), ret);
because copy's third parameter is required to be an iterator, and we supplied a container as the corresponding argument. Nor could we call
// errorno element at ret.end() copy(bottom.begin(), bottom.end(), ret.end());
This latter mistake is particularly insidious because the program will compile. What it does when you try to run it is another story entirely. The first thing copy will try to do is assign a value to the element at ret.end(). There's no element there, so what the implementation will do is anybody's guess.
Why is copy designed this way? Because separating the notions of copying elements and expanding a container allows programmers to choose which operations to use. For example, we might want to copy elements on top of elements that already exist in a container, without changing the container's size. As another example, we might want to use back_inserter to append elements to a container that are not merely copies of another container's elements.
Another Way to split
Another function that we can write more directly using the standard algorithms is split. The hard part of writing that function was dealing with the indices that delimited each word in the input line. We can replace the indices by iterators and use standard-library algorithms to do much of the work for us:
// true if the argument is whitespace, false otherwise bool space(char c) { return isspace(c); } // false if the argument is whitespace, true otherwise bool not_space(char c) { return !isspace(c); } vector<string> split(const string& str) { typedef string::const_iterator iter; vector<string> ret; iter i = str.begin(); while (i != str.end()) { // ignore leading blanks i = find_if(i, str.end(), not_space); // find end of next word iter j = find_if(i, str.end(), space); // copy the characters in [i, j) if (i != str.end()) ret.push_back(string(i, j)); i = j; } return ret; }
This code uses a lot of new functions, so it will take a bit of explanation. The key idea to keep in mind is that it implements the same algorithm as the original, using i and j to delimit each word in str. Once we've found a word, we copy it from str and push the copy onto the back of ret.
This time, i and j are iterators, not indices. We use typedef to abbreviate the iterator type so that we can use iter instead of the longer string::const_iterator. Although the string type does not support all of the container operations, it does support iterators. Therefore, we can use the standard-library algorithms on the characters of a string, just as we can use them on the elements of a vector.
The algorithm that we use in this example is find_if. Its first two arguments are iterators that denote a sequence; the third is a predicate, which tests its argument and returns true or false. The find_if function calls the predicate on each element in the sequence, stopping when it finds an element for which the predicate yields true.
The standard library provides an isspace function to test whether a character is a space. However, that function is overloaded so that it will work with languages such as Japanese that use other character types, such as wchar_t ( § 1.3/14). It's not easy to pass an overloaded function directly as an argument to a template function. The trouble is that the compiler doesn't know which version of the overloaded function we mean because we haven't supplied any arguments that the compiler might use to select a version. Accordingly, we'll write our own predicates, called space and not_space, that make clear which version of isspace we intend.
The first call to find_if seeks the first nonspace character that begins a word. Remember that one or more spaces might begin a line or might separate adjacent words in the input. We don't want to include these spaces in the output.
After the first call to find_if, i will denote the first nonspace, if any, in str. We use i in the next call to find_if, which looks for the first space in [i, str.end()). If find_if fails to find a value that satisfies the predicate, it returns its second argument, which, in this case, is str.end(). Therefore, j will be initialized to denote the blank that separates the next word in str from the rest of the line, or, if we are on the last word in the line, j will be equal to str.end().
At this point, i and j delimit a word in str. All that's left is to use these iterators to copy the data from str into ret. In the earlier version of split, we used string::substr to create the copy. However, that version of split operated on indices, not iterators, and there isn't a version of substr that operates on iterators. Instead, we construct a new string directly from the iterators that we have. We do so by using an expression, string(i, j), that is somewhat similar to the definition of spaces. Our present example constructs a string that is a copy of the characters in the range [i, j). We push this new string onto the back of ret.
It is worth pointing out that this version of the program omits the tests of the index i against str.size(). Nor are there the obvious equivalent tests of the iterator against str.end(). The reason is that the library algorithms are written to handle gracefully calls that pass an empty range. For example, at some point the first call to find_if will set i to the value returned by str.end(), but there is no need to check i before passing it to the second call to find_if. The reason is that find_if will look in the empty range [i, str.end()) and will return str.end() to indicate that there is no match.
Palindromes
Another character-manipulation problem that we can use the library to solve succinctly is determining whether a word is a palindrome. Palindromes are words that are spelled the same way front to back and back to front. For example, civic, eye, level, madam, and rotor are all palindromes.
Here is a compact solution that uses the library:
bool is_palindrome(const string& s) { return equal(s.begin(), s.end(), s.rbegin()); }
The return statement in this function's body calls the equal function and the rbegin member function, both of which we have not yet seen.
Like begin, rbegin returns an iterator, but this time it is an iterator that starts with the last element in the container and marches backward through the container.
The equal function compares two sequences to determine whether they contain equal values. As usual, the first two iterators passed to equal specify the first sequence. The third argument is the starting point for the second sequence. The equal function assumes that the second sequence is the same size as the first, so it does not need an ending iterator. Because we pass s.rbegin() as the starting point for the second sequence, the effect of this call is to compare values from the back of s to values in the front. The equal function will compare the first character in s with the last. Then it will compare the second to the next-to-last, and so on. This behavior is precisely what we want.
Finding URLs
As the last of our examples of character manipulation, let's write a function that finds Web addresses, called uniform resource locators (URLs), that are embedded in a string. We might use such a function by creating a single string that holds the entire contents of a document. The function would then scan the document and find all the URLs in it.
A URL is a sequence of characters of the form:
protocol-name://resource-name
where protocol-name contains only letters, and resource-name may consist of letters, digits, and certain punctuation characters. Our function will take a string argument and will look for instances of :// in that string. Each time we find such an instance, we'll look for the protocol-name that precedes it and the resource-name that follows it.
Because we want our function to find all the URLs in its input, we'll want it to return a vector<string>, with one element for each URL. The function executes by moving the iterator b through the string, looking for the characters :// that might be a part of a URL. If we find these characters, it looks backward to find the protocol-name, and it looks forward to find the resource-name:
vector<string> find_urls(const string& s) { vector<string> ret; typedef string::const_iterator iter; iter b = s.begin(), e = s.end(); // look through the entire input while (b != e) { // look for one or more letters followed by :// b = url_beg(b, e); // if we found it if (b != e) { // get the rest of the URL iter after = url_end(b, e); // remember the URL ret.push_back(string(b, after)); // advance band check for more URLs on this line b = after; } } return ret; }
We start by declaring ret, which is the vector into which we will put the URLs as we find them, and by obtaining iterators that delimit the string. We will have to write the url_beg and url_end functions, which will find the beginning and end of any URL in the input. The url_beg function will be responsible for identifying whether a valid URL is present and, if so, for returning an iterator that refers to the first character of the protocol-name. If it does not identify a URL in the input, then it will return its second argument (e, in this case), to indicate failure.
If url_beg finds a URL, the next task is to find the end of the URL by calling url_end. That function will search from the given position until it reaches either the end of the input or a character that cannot be part of a URL. It will return an iterator positioned one past the last character in the URL.
Thus, after the calls to url_beg and url_end, the iterator b denotes the beginning of a URL, and the iterator after denotes the position one past the last character in the URL:
e text http ://http://www.acceleratedcpp.com more text 107 b = url_beg(b, e) after = url_end(b, e)
We construct a new string from the characters in this range and push that string onto the back of ret.
All that remains is to increment the value of b and to look for the next URL. Because URLs cannot overlap one another, we set b to (one past) the end of the URL that we just found and continue the while loop until we've looked at all the input. Once that loop exits, we return the vector that contains the URLs to our caller.
Now we have to think about url_beg and url_end. The url_end function is simpler, so we'll start there:
string::const_iterator url_end(string::const_iterator b, string::const_iterator e) { return find_if(b, e, not_url_char); }
This function just forwards its work to the library find_if function. The predicate that we pass to find_if is one that we will write, named not_url_char. It will return true when passed a character that cannot be in a URL:
bool not_url_char(char c) { // characters, in addition to alphanumerics, that can appear in a URL static const string url_ch = "~;/?:@=&$-_.+!*'(),"; // see whether c can appear in a URL and return the negative return !(isalnum(c) || find(url_ch.begin(), url_ch.end(), c) != url_ch.end()); }
Despite being small, this function uses a fair bit of new material. First is the use of the static storage class specifier. Local variables that are declared to be static are preserved across invocations of the function. Thus, we will construct and initialize the string url_ch only on the first call to not_url_char. Subsequent calls will use the object that the first call constructed. Because url_ch is a const string, its value will not change once we have initialized it.
The not_url_char function also uses the isalnum function, which the <cctype> header defines. This function tests whether its argument is an alphanumeric character (a letter or a digit).
Finally, find is another algorithm that we haven't used yet. It is similar to find_if, except that, instead of calling a predicate, it looks for the specific value given as its third argument. As with find_if, if the value that we want is present, the function returns an iterator denoting the first occurrence of the value in the given sequence. If the value is not found, then find returns its second argument.
With this information in hand, we can now understand the not_url_char function. Because we negate the value of the entire expression before we return it, not_url_char will yield false if c is a letter, a digit, or any of the characters in url_ch. If c is any other value, the function returns true.
Now the hard part begins: implementing url_beg. This function is messy because it must deal with the possibility that the input might contain :// in a context that cannot be a valid URL. In practice, we'd probably have a list of acceptable protocol-names and look only for those. For simplicity, though, we'll limit ourselves to being sure that one or more letters precede the :// separator and at least one character follows it:
string::const_iterator url_beg(string::const_iterator b, string::const_iterator e) { static const string sep = "://"; typedef string::const_iterator iter; // i marks where the separator was found iter i = b; while ((i = search(i, e, sep.begin(), sep.end())) != e) { // make sure the separator isn't at the beginning or end of the line if (i != b && i + sep.size() != e) { // beg marks the beginning of the protocol-name iter beg = i; while (beg != b && isalpha(beg[-1])) --beg; // is there at least one appropriate character before and after the separator? if (beg != i && !not_url_ char(i[sep.size()])) return beg; } // the separator we found wasn't part of a URL; advance i past this separator i += sep.size(); } return e; }
The easy part is writing the function header. We know that we'll be passed two iterators denoting the range in which to look and that we'll return an iterator that denotes the beginning of the first URL in that range, if one exists. We also declare and initialize a local string, which will hold the characters that make up the separator that identifies a potential URL. Like url_ch in the not_url_char function, this string is static and const. Thus, we will not be able to change the string, and its value will be created only on the first invocation of url_beg.
The function executes by placing two iterators into the string delimited by band e:
e 109 b text http://www.acceleratedcpp.com more text beg i
The iterator i will denote the beginning of the URL separator, if any, and beg will indicate the beginning of the protocol-name, if any.
The function first looks for the separator by calling search, a library function that we haven't used before. This function takes two pairs of iterators: The first pair denotes the sequence in which we are looking, and the second pair denotes the sequence that we want to locate. As with other library functions, if search fails, it returns the second iterator. Therefore, after the call to search, either i denotes (one past) the end of the input string, or it denotes a : that is followed by //.
If we found a separator, the next task is to get the letters (if any) that make up the protocol-name. We first check whether the separator is at the beginning or the end of the input. If the separator is in either of those places, we know that we don't have a URL because a URL has at least one character on each side of its separator. Otherwise, we need to try to position the iterator beg. The inner while loop moves beg backward through the input until it hits either a nonalphabetic character or the beginning of the string. It uses two new ideas: The first is the notion that if a container supports indexing, so do its iterators. In other words, beg[-1] is the character at the position immediately before the one that beg denotes. We can think of beg[-1] as an abbreviation for *(beg - 1). The second new idea is the isalpha function, defined in <cctype>, which tests whether its argument is a letter.
If we were able to advance the iterator over as much as a single character, we can assume that we've found a protocol-name. Before returning beg, we still have to check that there's at least one valid character following the separator. This test is more complicated. We know that there is at least one more character in the input because we're inside the body of an if that compares the value of i + sep.size() with e. We can access the first such character as i[sep.size()], which is an abbreviation for *(i + sep.size()). We test whether that character can appear in a URL by passing the character to not_url_char. This function returns true if the character is not valid, so we negate the return to check whether the character is valid.
If the separator is not part of a URL, then the function advances i past the separator and keeps looking.
This code uses the decrement operator, which we have not previously used. It works like the increment operator, but it decrements its operand instead. As with the increment operator, it comes in prefix and postfix versions. The prefix version, which we use here, decrements its operand and returns the new value.