- Windows Memory Architecture Overview
- Garbage Collector Internals
- Debugging Managed Heap Corruptions
- Debugging Managed Heap Fragmentation
- Debugging Out of Memory Exceptions
- Summary
Debugging Managed Heap Corruptions
A heap corruption is best defined as a bug that violates the integrity of the heap and causes strange behaviors to occur in an application. The symptoms of a heap corruption are vast and can range from subtle and random behaviors or a flat-out crash that stops an application in its tracks. For example, consider an application that has an object whose state controls the frequency with which work items are pulled from a queue. If a thread inadvertently changes the frequency due to corrupting the memory of the object, work items may be pulled off much quicker than the system can handle, or, conversely, work items may not be pulled out at all, causing processing delays. In a situation like this, tracking down the culprit can be difficult because the behavior is exhibited after the corruption has already taken place. In fact, when working with heap corruptions, the best case scenario is a crash that happens as close to the source of the corruption as possible, eliminating the need for a lot of painful historic back tracking of how the heap ended up being corrupted in the first place.
Due to the subtle nature of heap corruption symptoms, it is also one of the trickiest categories of bugs to debug. To begin with, what causes a heap corruption to occur? Generally speaking, there are probably as many different causes for heap corruptions as there are symptoms, but one very common cause is that of not properly managing the memory that the application owns. Problems such as reuse after free, dangling pointers, buffer overruns, and so on can all be possible heap corruption culprits. The good news is that the CLR eliminates many of these problems by effectively managing the memory on the application's behalf. For example, reuse after free is no longer possible because an object isn't collected while rooted, buffer overruns are trapped and surfaced as an exception, and dangling pointers are not easily achieved. Although the CLR very effectively eliminates a lot of the heap corruption culprits, it does so only when the code runs within the confines of the managed execution environment. Often, it is necessary for a managed code application to call into native code and pass data to the native API. The second that the code transitions into the native world, the data that reside on the managed heap and are passed to the native code are no longer under the protection of the CLR and can cause all sorts of problems unless carefully managed before making the transition. For example, buffer overruns are no longer trapped and the compacting nature of the GC can cause pointers to become stale. The managed to native code interaction is one of the biggest heap corruption culprits in the managed world.
In this part of the chapter, we will look at an example of an application that suffers from a heap corruption. Listing 5-7 illustrates the application's source code.
Listing 5-7. Example of an application that suffers from a heap corruption
using System; using System.Text; using System.Runtime.InteropServices; namespace Advanced.NET.Debugging.Chapter5 { class Heap { static void Main(string[] args) { Heap h = new Heap(); h.Run(); } public void Run() { byte[] b = new byte[50]; for (int i = 0; i < 50; i++) b[i] = 15; Console.WriteLine("Press any key to invoke native method"); Console.ReadKey(); InitBuffer(b, 50); Console.WriteLine("Press any key to exit"); Console.ReadKey(); } [DllImport("05Native.dll")] static extern void InitBuffer(byte[] buffer, int size); } }
The source code and binary for Listing 5-7 can be found in the following folders:
- Source code: C:\ADND\Chapter5\Heap
- Binary: C:\ADNDBin\05Heap.exe and C:\ADNDBin\05Native.dll
Note that to better illustrate the debug session, the native source code is not shown. The application in Listing 5-6 allocates a byte array (50 elements) and calls into a native API to initialize the memory by passing in the byte array as well as the size of the array. If we run the application under the debugger, we can very quickly see that an access violation occurs:
... ... ... Press any key to invoke native method ModLoad: 71190000 711ab000 C:\ADNDBin\05Native.dll ModLoad: 63f70000 64093000 C:\Windows\WinSxS\x86_microsoft.vc90.debugcrt _1fc8b3b9a1e18e3b_9.0.21022.8_none_96748342450f6aa2\MSVCR90D.dll (1b00.26e4): Access violation - code c0000005 (first chance) First chance exceptions are reported before any exception handling. This exception may be expected and handled. eax=77767574 ebx=00000001 ecx=01c659a4 edx=01c66ad8 esi=01c66868 edi=00000017 eip=7936ab16 esp=0031edac ebp=00000017 iopl=0 nv up ei pl nz na pe nc cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010206 *** WARNING: Unable to verify checksum for C:\Windows\assembly\NativeImages_v2.0.50727_32mscorlib\5b3e3b0551bcaa722c27dbb089c431e4\mscorlib.ni.dll mscorlib_ni+0x2aab16: 7936ab16 ff90a4000000 call dword ptr [eax+0A4h] ds:0023:77767618=???????? 0:000> !ClrStack OS Thread Id: 0x26e4 (0) ESP EIP 0031edac 7936ab16 System.IO.StreamWriter.Flush(Boolean, Boolean) 0031edcc 7936b287 System.IO.StreamWriter.Write(Char[], Int32, Int32) 0031edec 7936b121 System.IO.TextWriter.WriteLine(System.String) 0031ee04 7936b036 System.IO.TextWriter+SyncTextWriter.WriteLine(System.String) 0031ee10 793e9d86 System.Console.WriteLine(System.String) 0031ee1c 00810171 Advanced.NET.Debugging.Chapter5.Heap.Run() 0031ee48 008100a7 Advanced.NET.Debugging.Chapter5.Heap.Main(System.String[]) 0031f068 79e7c74b [GCFrame: 0031f068]
What is interesting about the access violation is the stack trace of the offending thread. It looks like the access violation occurred while making our second call to the Console.WriteLine method (right after our call to the native InitBuffer API). Even if we assume that a heap corruption is taking place, why is it failing in some seemingly random place in the code base? Again, it is important to remember that a heap corruption rarely breaks at the point of the corruption; rather, it breaks at some seemingly random place later in the execution flow. This would certainly qualify as random because we certainly do not expect a call to Console.WriteLine to ever fail with an access violation. Armed with the knowledge that an access violation has occurred and that the access violation occurred in a rather strange part of the execution flow, we can now theorize that we have a possible heap corruption on our hands. The big question is, how do we verify our theory? Remember our earlier definition of a heap corruption: a violation of the integrity of the heap. If we can walk all objects on the heap, and verify the validity of each object, we can say for sure whether the integrity has been violated. Although it's possible to walk the entire managed heap by hand, it is a time-consuming process to say the least. Fortunately, the SOS VerifyHeap command automates this process for us. The VerifyHeap command walks the entire managed heap, validating each object along the way, and reports the results of the validation. If we run the command in our debug session, we can see the following:
0:000> !VerifyHeap -verify will only produce output if there are errors in the heap object 01c65968: does not have valid MT curr_object : 01c65968 Last good object: 01c65928 –––––––––––––––– object 02c61010: bad member 01c65968 at 02c61084 object 02c61010: bad member 01c65984 at 02c6109c object 02c61010: bad member 01c659fc at 02c61444 object 02c61010: bad member 01c659e4 at 02c61448 object 02c61010: bad member 01c659f0 at 02c6144c object 02c61010: bad member 01c659c8 at 02c6158c curr_object : 02c61010 Last good object: 02c61000 ––––––––––––––––
In the preceding output, we can see that there seems to be a number of problems with our managed heap. More specifically, the first error encountered seems to be with the object located at address 0x01c65968 not having a valid MT (method table). We can easily verify this by hand by dumping out the contents of that address using the dd command:
0:000> dd 01c65968 l1 01c65968 3b3a3938 0:000> dd 3b3a3938 l1 3b3a3938 ????????
The method table of the object located at address 0x01c65968 seems to be 0x3b3a3938, which furthermore is shown to be an invalid address. At this point, we know we are working with a corrupted heap starting with an object at address 0x01c65968, but what we don't know yet is how it got corrupted. A useful technique in situations like this is to investigate objects surrounding the corrupted memory area. For example, what does the previous object look like? The output of VerifyHeap shows the address of the last good object to be 0x01c65928. If we dump out the contents of that object, we can see the following:
0:000> !do 01c65928 Name: System.Byte[] MethodTable: 7912dae8 EEClass: 7912dba0 Size: 62(0x3e) bytes Array: Rank 1, Number of elements 50, Type Byte Element Type: System.Byte Fields: None 0:000> !objsize 01c65928 sizeof(01c65928) = 64 ( 0x40) bytes (System.Byte[])
The object in question appears to be a byte array with 50 elements, which also looks very similar to the byte array that we created in our application. Furthermore, because the do command is capable of displaying details of the object, the object's metadata seems to be structurally intact. Please note that the objsize command was used to get the total size (including members of the object) of the object (64). The next interesting piece of information to look at is the contents of the array itself. We can use the dd command to display the entire object in raw memory form:
0:000> dd 01c65928 01c65928 7912dae8 00000032 03020100 07060504 01c65938 0b0a0908 0f0e0d0c 13121110 17161514 01c65948 1b1a1918 1f1e1d1c 23222120 27262524 01c65958 2b2a2928 2f2e2d2c 33323130 37363534 01c65968 3b3a3938 3f3e3d3c 43424140 47464544 01c65978 4b4a4948 4f4e4d4c 53525150 57565554 01c65988 5b5a5958 5f5e5d5c 63626160 67666564 01c65998 6b6a6968 6f6e6d6c 73727170 77767574
In the output, we can see that the 64 bytes that the object occupies begin with the method table indicating the type of the array followed by the number of elements in the array followed by the array contents itself. The next object begins at address 0x01c65928 ((starting address of object)+0x40(total size of object)). If we look at the contents of the last good object (0x01c65928), we can see that the array contains incremental integer values. Furthermore, when the end of the last good object is reached, we still see a progression of the incremental integer values spilling over to what is considered the next object on the heap (0x01c65968). This observation yields a very important clue as to what may potentially be happening. If the object at address 0x01c65928 was incorrectly written and allowed to write past the end of the object boundary, we would corrupt the next object in the heap. Figure 5-12 illustrates the scenario.
Figure 5-12 Managed heap corruption
At this point, we have a pretty good understanding of the data shown to us in the debugger. By code reviewing the parts of the application that manipulate our byte array, we can see that when we pass the byte array to the native InitBuffer API the function does not respect the boundaries of the object and writes past the end of the object, causing the subsequent object on the heap to become corrupted (as output by the VerifyHeap command).
There is one additional piece of information that was displayed by the VerifyHeap command earlier:
object 02c61010: bad member 01c65968 at 02c61084 object 02c61010: bad member 01c65984 at 02c6109c object 02c61010: bad member 01c659fc at 02c61444 object 02c61010: bad member 01c659e4 at 02c61448 object 02c61010: bad member 01c659f0 at 02c6144c object 02c61010: bad member 01c659c8 at 02c6158c curr_object : 02c61010 Last good object: 02c61000
VerifyHeap is telling us that there exists an object located at address 0x02c61010 that contains a member that references the corrupted object starting at address 0x01c65968. As a matter of fact, there are multiple lines stating that the same object is referencing a number of different members of the corrupted object at various addresses (0x01c65968, 0x01c65984, 0x01c659fc, etc). In essence, VerifyHeap not only tells us which object is corrupted, but any other object on any of the heaps that references the corrupt object will also be displayed.
The sample application we used to demonstrate how the managed heap can become corrupted was based on using the interoperability services to invoke native code. Depending on how the heap is corrupted by the native code, as well as the timing of garbage collections, there may not be any signs of a heap corruption being present until much later after the native code has already done the damage, making it difficult to backtrack to the source of the problem. To aid in this troubleshooting process, an MDA was added called the gcUnmanagedToManaged MDA. Essentially, the MDA aims at reducing the time gap between when the corruption actually occurs in native code and when the next GC occurs. The way this is accomplished is by forcing a garbage collection when the interoperability call transitions back from unmanaged to managed code, thereby pinpointing the problem much earlier in the process. Let's enable the MDA (please see Chapter 1, "Introduction to the Tools" on how to enable MDAs) and rerun our sample application under the debugger to see if we can trap the heap corruption earlier:
... ... ... Press any key to invoke native method ModLoad: 71190000 711ab000 C:\ADNDBin\05Native.dll ModLoad: 63f70000 64093000 C:\Windows\WinSxS\x86_microsoft.vc90. debugcrt_1fc8b3b9a1e18e3b_9.0.21022.8_none_96748342450f6aa2\MSVCR90D.dll (19d8.258c): Access violation - code c0000005 (first chance) First chance exceptions are reported before any exception handling. This exception may be expected and handled. eax=3b3a3938 ebx=02d81010 ecx=00960184 edx=01d8598c esi=00020000 edi=00001000 eip=79f66846 esp=0025ec54 ebp=0025ec74 iopl=0 nv up ei pl nz na po nc cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010202 mscorwks!WKS::gc_heap::mark_object_simple+0x16c: 79f66846 0fb708 movzx ecx,word ptr [eax] ds:0023:3b3a3938=???? 0:000> k ChildEBP RetAddr 0025ec74 79f66932 mscorwks!WKS::gc_heap::mark_object_simple+0x16c 0025ec88 79fbc552 mscorwks!WKS::GCHeap::Promote+0x8d 0025eca0 79fbc3c9 mscorwks!PinObject+0x10 0025ecc4 79fc37b9 mscorwks!ScanConsecutiveHandlesWithoutUserData+0x26 0025ece4 79fba942 mscorwks!BlockScanBlocksWithoutUserData+0x26 0025ed08 79fba917 mscorwks!SegmentScanByTypeMap+0x55 0025ed60 79fba807 mscorwks!TableScanHandles+0x65 0025edc8 79fbb9a2 mscorwks!HndScanHandlesForGC+0x10d 0025ee0c 79fbaaf8 mscorwks!Ref_TracePinningRoots+0x6c 0025ee30 79f669f6 mscorwks!CNameSpace::GcScanHandles+0x60 0025ee70 79f65d57 mscorwks!WKS::gc_heap::mark_phase+0xae 0025ee94 79f6614c mscorwks!WKS::gc_heap::gc1+0x62 0025eea8 79f65f5d mscorwks!WKS::gc_heap::garbage_collect+0x261 0025eed4 79f6dfa1 mscorwks!WKS::GCHeap::GarbageCollectGeneration+0x1a9 0025eee4 79f6df4b mscorwks!WKS::GCHeap::GarbageCollectTry+0x2d 0025ef04 7a0aea3d mscorwks!WKS::GCHeap::GarbageCollect+0x67 0025ef8c 7a12addd mscorwks!MdaGcUnmanagedToManaged::TriggerGC+0xa7 0025f020 79e7c74b mscorwks!FireMdaGcUnmanagedToManaged+0x3b 0025f030 79e7c6cc mscorwks!CallDescrWorker+0x33 0025f0b0 79e7c8e1 mscorwks!CallDescrWorkerWithHandler+0xa3 0:000> !ClrStack OS Thread Id: 0x258c (0) ESP EIP 0025efdc 79f66846 [NDirectMethodFrameStandalone: 0025efdc] Advanced.NET.Debugging.Chapter5.Heap.InitBuffer(Byte[], Int32) 0025efec 00a80165 Advanced.NET.Debugging.Chapter5.Heap.Run() 0025f018 00a800a7 Advanced.NET.Debugging.Chapter5.Heap.Main(System.String[]) 0025f240 79e7c74b [GCFrame: 0025f240]
We can see here that the native stack trace that caused the access violation looks a lot different than our earlier stack trace. It now looks like we are hitting the problem during a garbage collection. Where in our managed code flow did the garbage collection occur? If we look at the managed code stack trace, we can see that we now get the access violation during our call to the native InitBuffer API.
If you ever suspect that a heap corruption might be taking place due to a native API invocation, enabling the gcUnmanagedtoManaged MDA can save a ton of debugging time.