But It Was Faster!
One common trap in optimizing code is to create micro-benchmarks to test a particular approach, which leads people to make very heavy use of things like function inlining and C++ templates. If you inline a function, you get to avoid the overhead of the function call, which makes things faster. If you use a C++ template, you get a compile-time expansion, so you can avoid the cost of runtime lookup. However, these techniques both come at the expense of increasing code size. There's a growing belief that code size doesn't matter, but this idea is misguided. The Core 2 Duo has a 32KB level-1 instruction cache, and typically around 4MB of level-2 cache that's shared by instructions and data. The more time you spend with your code in the L1 cache, the faster it will run. The more L2 cache used for your code and not your data, the slower it will run.
On code that's used frequently, it's fairly common for the cost of inlining it all over the place to be much greater than the cost of doing a function call. This difference in cost is one reason for a C++ program being slower than something written in a more dynamic language, despite all your profiling that predicts the opposite result. The C++ program, with its repeated template instantiations, is causing a lot more cache misses due to its large code size.
In effect, this is just an extension of the "Don't repeat yourself" rule; it applies to compiled code almost as much as to source code. You typically can't fit your code in L1 cache, but if you can fit the code that's used 90% of the time in L2 cache, then you'll see very good performance (that is, until you start getting cache misses for data access).
This result is very problematic to measure, because it depends a lot on the state of the system. When running benchmarks, it's common to avoid running anything else. Unfortunately, that's not how most code will end up being used; it will be run concurrently with a lot of other programs, all wanting a slice of that cache. The 4MB L2 cache may only have 128KB available for your program and its data, and that's when you really start to notice speed problems from code bloat.