- Register and Memory Increases
- Virtual Memory Design
- Instruction Encodings
- Floating-Point Enhancements / Cryptographic Support
- Mixing 32-bit Code / The Future
Floating-Point Enhancements
One of the areas where ARM is traditionally weak is in floating-point computations. Initially they didn't include any floating-point hardware; then they only included support for single-precision floating point.
ARMv8 improves this support dramatically. The floating-point register set is now 32 registers, each of which is 128 bits wide, allowing it to store four single-precision or two double-precision values. The architecture now fully supports the IEEE 754 standard for floating point, including all of the strange rounding modes and not-a-number values (for example, the result of division by zero) that the specification requires.
To put this improvement into perspective, SSE (found on modern x86 chips) provides only 16 128-bit registers. AVX, introduced in 2008, extends them to 256 bits. This means that ARMv8 and x86 chips have the same-sized vector register files, but with different layouts.
Which of these is more useful in practice? In theory, the x86 approach should give better throughput for the same number of instructions, because it lets you operate on twice as much data at a time, but the cost is limited flexibility. There's a reason that we don't have 1024-bit vector coprocessors in desktop CPUs: As the size of the vector increases, the number of problems that can make use of it decreases. 128 bits was popular because it's very useful for 3D graphics. Color and vertex values fit nicely into those registers.
If your code only makes use of 128-bit vectors, half of the register space in AVX is wasted. A lot of code uses SSE for purely scalar operations, because the source code is not amenable to vectorization. For code like this, AVX looks like a bank of 16 floating point registers and NEON like a bank of 32 floating-point registers. This makes NEON much easier for compilers to target, because the register allocator has to do a lot less work to find a spare register.
Cryptographic Support
A modern server has to do a lot of encryption and decryption. Most network connections want to be encrypted, and it's increasingly common to encrypt the contents of the disk as a precaution against theft. Being able to implement common encryption algorithms efficiently is important, but for the best performance and power usage it's even better to have them implemented in hardware.
Recent AMD and Intel chips provide custom instructions for implementing AES encryption. This design reduces the CPU cost of AES encryption and decryption to around a tenth of its cost in a pure software implementation.
ARMv8 goes one step further, providing SHA-1 and SHA-256 instructions. They keep a running hash in one of the 128-bit SIMD registers and allow 128 bits of input data to be processed in a single instruction.