Windows Hardware Error Architecture Visited!

Most of the Debuggers which do debugging over the Windows Operating System and analyze loads of Dump files generated by Windows would be familiar with the WHEA or the Windows Hardware Error Architecture or also known as by it's bug check code 0x124. 
 
This is most common bug-check which I encounter while dealing with hardware problems occuring on a machine. In this post, I will try to explain the Hardware Architecture from scratch. 
 
Bug-checkWHEA_UNCORRECTABLE_ERROR 
Bug Check Code – 0x124 
 
So, let's get started with a bit of history shall we? WHEA was introduced in Windows Vista. In the operating systems prior to Windows Vista, the operating system supported several unrelated mechanisms for hardware errors. These mechanisms provided little support for error recovery. For uncorrected errors, the operating system simply generated a bug check and then recorded some of the available error information in the system event log after the system was restarted. 
 
So, what is the problem here when Windows was having the mechanisms for Hardware Error Reporting? Well, as written above, the Dump file which is written most of the times during a bug check by Windows contained very little information which made it harder to debug. Already the current art of debugging latest operating systems is like alien to most of the people so you could imagine when WHEA was not present. The only approach left when nothing was revealed by the dump file was the caveman approach of swapping and replacing each part till the crashes stopped. Well this approach wastes a lot of time and money as well in some cases where hardware failure is very costly. Suppose there are two device drivers running on the system. When the WHEA was absent one driver could report that that it is not able to process IRP's (These will be discussed in my upcoming post) while the other one even if it is not able to process the IRP's could report some other error.  
The introduction of the WHEA enabled Windows to coordinate with different hardware components and provide a common error reporting mechanism so that the identification of problem if any becomes easy. 
Below, I would detailing the process of debugging a Crash Dump having the bug check 0x124. 
The bug check 0x124 generally occurs when there is a problem with Processor (I have seen mostly these), other hardware (only seen once) or some low level driver ( Thanks [Usasma](http://www.sysnative.com/forums/members/usasma.html)& [Jared](http://www.sysnative.com/forums/members/jared.html)). The problem could also arise if you are overclocking or in other words pushing your hardware components beyond their operating conditions. Overheating could also cause this error. 
Now, let's get started with the analysis! 
On opening the dump file having bug check code 0x124 we get the following information from WinDBG (My favourite Debugger) – 
 
*Use !analyze -v to get detailed debugging information. *
*BugCheck 124, {0, ffffe000ef74a028, be200000, 5110a} *
*Probably caused by : GenuineIntel *
*Followup: MachineOwner *
 
We can see that the first argument stored by the Dump file is 0 which means that a Machine Check Exception(MCE) which was FATAL occurred. A MCE is generated by some specific processors like Intel or AMD 64 bit. In this bug check, the 2nd parameter is address of the WHEA_ERROR _Record Structure in which the information regarding the Error is stored by the WHEA. The 3rd & 4th arguments provide us with the High 32 bit address and low 32 bit address of the Machine Check Architecture(MCA) Bank which had the error. MCA is a mechanism of reporting the Hardware Errors or rather announcing the errors so that the WHEA could catch them. The WinDBG is also kind enough to let us know that we are dealing with a machine here which is having an Intel Processor (See the Probably Caused Line) If we run !errrec followed by the second parameter, we get the below output – 
*0: kd> !errrecffffe000ef74a028  ===================================================================  Common Platform Error Record @ ffffe000ef74a028  ———————————————————————  Record Id     : 01cfd62d6c522bc4  Severity      : Fatal (1)  Length        : 928  Creator       : Microsoft  Notify Type   : Machine Check Exception  Timestamp     : 9/22/2014 19:09:02 (UTC)  Flags         : 0x00000000 *
*=================================================================  Section 0     : Processor Generic  ———————————————————————  Descriptor    @ ffffe000ef74a0a8  Section       @ ffffe000ef74a180  Offset        : 344  Length        : 192  Flags         : 0x00000001 Primary  Severity      : Fatal *
*Proc. Type    : x86/x64  Instr. Set    : x64  Error Type    : Cache error  Operation     : Generic  Flags         : 0x00  Level         : 2  CPU Version   : 0x00000000000206a7  Processor ID  : 0x0000000000000000 *
*==================================================================Section 1     : x86/x64 Processor Specific  ———————————————————————-  Descriptor    @ ffffe000ef74a0f0  Section       @ ffffe000ef74a240  Offset        : 536  Length        : 128  Flags         : 0x00000000  Severity      : Fatal *
*Local APIC Id : 0x0000000000000000  CPU Id        : a7 06 02 00 00 08 10 00 – bf e3 9a 1f ff fb eb bf                  00 00 00 00 00 00 00 00 – 00 00 00 00 00 00 00 00                  00 00 00 00 00 00 00 00 – 00 00 00 00 00 00 00 00 *
*Proc. Info 0  @ ffffe000ef74a240 *
*===================================================================  Section 2     : x86/x64 MCA  ———————————————————————  Descriptor    @ ffffe000ef74a138  Section       @ ffffe000ef74a2c0  Offset        : 664  Length        : 264  Flags         : 0x00000000  Severity      : Fatal *
*Error         : GCACHEL2_ERR_ERR (Proc 0 Bank 5)    Status      : 0xbe2000000005110a    Address     : 0x000000015c0e4640    Misc.       : 0x000000d080020086 *
Now, as we can see that the highlighted part says GCACHEL2_ERR_ERR (Proc 0 Bank 5) which means that there has been an error in the L2 Cache. Here Bank refers to the Memory Bank which is a division of the Cache Memory. So, now we know what it is at fault here. Although if you are debugging these errors, I would suggest you to get more dump files and then check the error codes. If the error codes are similar, use Prime 95 to stress test the CPU. If you don't know what to do further or have some query, do not hesitate to drop me a mail using the contact form! 

Pranav V Jituri

Read more posts by this author.

India

Subscribe to OMGDebugging!!!

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!