Hardware Sparse Arrays
The core of any dynamic language implementation is an efficient sparse array for mapping selector (method name) and receiver pairs to method implementations. It's impossible to overstress the importance of this design. My Objective-C runtime library initially used a fairly naïve implementation of a sparse array, as a dynamic tree structure. Tweaking it to use a static landing pad rather than a pair of conditionals in the case when no method was found resulted in around a 50% speed improvement in loops containing a lot of message sends (method calls). Since almost every operation in a dynamic language is a message send, this change is very important.
Sending a message is typically a factor of two or three times more expensive than a direct C function call. A lot of tricks are employed to avoid this overhead. A common one is polymorphic inline caching, where the mapping from class and selector (method name) to method address is stored for a small number of classes. This technique is difficult to perform correctly, because a dynamic language can update these mappings at any time. I get around this problem by storing a version as well, and virtual machine environments trigger invalidations automatically on method updates. The disadvantage is that it requires a few bytes at every call site, which can drive up data size a lot.
Most modern CPUs already have a hardware implementation of an associative array in the form of a translation look-aside buffer (TLB) for handling virtual address translation. This is a mapping from virtual page addresses to physical page addresses, and is very efficient.
The most flexible implementation would probably be something like a 32- or 16-element map combined with an INVOKE instruction, which would take a map identifier and a key as arguments and jump to the result of the lookup (or issue a user-mode trap if the value was not found). This implementation would allow very fast jumps to commonly used methods.