17.7 Using Metaprograms to Unroll Loops
One of the first practical applications of metaprogramming was the unrolling of loops for numeric computations, which is shown here as a complete example.
Numeric applications often have to process n-dimensional arrays or mathematical vectors. One typical operation is the computation of the so-called dot product. The dot product of two mathematical vectors a and b is the sum of all products of corresponding elements in both vectors. For example, if each vectors has three elements, the result is
a[0]*b[0] + a[1]*b[1] + a[2]*b[2]
A mathematical library typically provides a function to compute such a dot product. Consider the following straightforward implementation:
// meta/Loop1.hpp #ifndef LOOP1_HPP #define LOOP1_HPP templateinline T dot_product (int dim, T* a, T* b) { T result = 0; for (int i=0; i i>LOOP1_HPP
When we call this function as follows
#include#include "loop1.hpp" int main() { int a[3] = { 1, 2, 3}; int b[3] = { 5, 6, 7}; std::cout << "dot_product(3,a,b) = " << dot_product(3,a,b) << '\n'; std::cout << "dot_product(3,a,a) = " << dot_product(3,a,a) << '\n'; }
we get the following result:
dot_product(3,a,b) = 38 dot_product(3,a,a) = 14
This is correct, but it takes too long for serious high-performance applications. Even declaring the function inline is often not sufficient to attain optimal performance.
The problem is that compilers usually optimize loops for many iterations, which is counterproductive in this case. Simply expanding the loop to
a[0]*b[0] + a[1]*b[1] + a[2]*b[2]
would be a lot better.
Of course, this performance doesnt matter if we compute only some dot products from time to time. But, if we use this library component to perform millions of dot product computations, the differences become significant.
Of course, we could write the computation directly instead of calling dot_product(), or we could provide special functions for dot product computations with only a few dimensions, but this is tedious. Template metaprogramming solves this issue for us: We program to unroll the loops. Here is the metaprogram:
// meta/Loop2.hpp #ifndef LOOP2_HPP #define LOOP2_HPP // primary template template <int DIM, typename T> class DotProduct { public: static T result (T* a, T* b) { return *a * *b + DotProduct<DIM-1,T>::result(a+1,b+1); } }; // partial specialization as end criteria template <typename T> class DotProduct<1,T> { public: static T result (T* a, T* b) { return *a * *b; } }; // convenience function template <int DIM, typename T> inline T dot_product (T* a, T* b) { return DotProduct<DIM,T>::result(a,b); } #endif // LOOP2_HPP
Now, by changing your application program only slightly, you can get the same result:
// meta/loop2.cpp #include <iostream> #include "loop2.hpp" int main() { int a[3] = { 1, 2, 3}; int b[3] = { 5, 6, 7}; std::cout << "dot_product<3>(a,b) = " << dot_product<3>(a,b) << '\n'; std::cout << "dot_product<3>(a,a) = " << dot_product<3>(a,a) << '\n'; }
Instead of writing
dot_product(3,a,b)
we write
dot_product<3>(a,b)
This expression instantiates a convenience function template that translates the call into
DotProduct<3,int>::result(a,b)
And this is the start of the metaprogram.
Inside the metaprogram the result is the product of the first elements of a and b plus the result of the dot product of the remaining dimensions of the vectors starting with their next elements:
template <int DIM, typename T> class DotProduct { public: static T result (T* a, T* b) { return *a * *b + DotProduct<DIM-1,T>::result(a+1,b+1); } };
The end criterion is the case of a one-dimensional vector:
template <typename T> class DotProduct<1,T> { public: static T result (T* a, T* b) { return *a * *b; } };
Thus, for
dot_product<3>(a,b)
the instantiation process computes the following:
DotProduct<3,int>::result(a,b) = *a * *b + DotProduct<2,int>::result(a+1,b+1) = *a * *b + *(a+1) * *(b+1) + DotProduct<1,int>::result(a+2,b+2) = *a * *b + *(a+1) * *(b+1) + *(a+2) * *(b+2)
Note that this way of programming requires that the number of dimensions is known at compile time, which is often (but not always) the case.
Libraries, such as Blitz++ (see [Blitz++]), the MTL library (see [MTL]), and POOMA (see [POOMA]), use these kinds of metaprograms to provide fast routines for numeric linear algebra. Such metaprograms often do a better job than optimizers because they can integrate higher-level knowledge into the computations.2 The industrial-strength implementation of such libraries involves many more details than the template-related issues we present here. Indeed, reckless unrolling does not always lead to optimal running times. However, these additional engineering considerations fall outside the scope of our text.