SysAdmins

Now Reading

The little virtual machine that is crashing Hyper-V on AMD

by The Server-Side Technology StaffMarch 22, 2020

At VaiSulWeb virtualization is so pervasive that basically there aren’t physical machines other than virtualization hosts or storage since about 2008 or 2009. That means about 10 years now. And counting. Microsoft Hyper-V served us very well and over time our reliance on that technology increased with happy results. It has been solid and consistent and allowed us to scale up more and more, adding new technologies and solutions and enabling us to virtualize roles and workloads that seemed a bit difficult to virtualize.

As many other providers, we recently started integrating AMD CPUs into our infrastructure, given the terrific advantages that those could bring to the datacenter especially for generic or mixed workloads. Only a subset of our infrastructure has been migrated to the new AMD servers but things got easier because of the many advantages that the Windows Server (+ Hyper-V) platform could provide and we carefully started to fill those hosts by migrating workloads. Results have been very pleasing.

Performance improvements and security

As we had planned, we immediately started benefiting of performance improvements. Not only CPU-bound tasks were faster but also I/O performance were terrific.

One of our goals was also to improve security by using AMD technologies that showed better resiliency when dealing with security issues when compared to Intel chipsets and CPUs and we were also prepared to face a few limitations. For example, Hyper-V isolated containers are not supported on non-Intel CPUs and thus cannot be used.

We started migrating our workloads to the new servers and Hyper-V functionalities like shared-nothing live migration and live migrations made the process quite easy and straightforward. In a few days about 80% of the target workloads have been migrated with minimal or no disruption at all. So far so good.

Unexpected crashes

Then something odd happened: one of the hosts started to crash. That was surely something we rarely faced in our 10+-years-long experience with Hyper-V but that specific host was crashing up to 2-3 times per day and while it usually was back online in about 2 minutes with all of its VMs restarted, the virtual machines that it was hosting were obviously also crashing causing downtimes. That was very surprising since similar machines (basically identical since they were using the same components) were not exhibiting any issue even after running for weeks in production and even more in our labs.

That machine had been running for days without issues then started crashing 2-3 times per day with no traceable pattern. Sometimes it could run for hours (10 or more) without issues, sometimes it was crashing two times in 15 minutes. Weird. And scary.

We decided to halt our migration to ensure that we didn’t miss any incompatibility between Windows Server 2019 and those AMD servers yet other servers were not having any issue and tracing back the issues we had, that specific machine had not been exhibiting issues for days before it started crashing so often.

Diagnosing the issues

The first thing that you might want to do in such cases is to ensure that all drivers are updated. After all, in most cases crashes are due to faulty drivers however we really needed to understand what was causing the issue and, most of all, why we weren’t experiencing it on other similar or identical machines. We tried to check if we could reproduce the crashes on other servers: maybe it was related to a specific CPU load or network traffic growing past a certain limit or even when the machine was generating certain I/O levels. We weren’t able to detect any recurring pattern and other machines had no troubles at all yet we were pretty sure the issue was related to AMD CPU and our crash dumps seemed to confirm:

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000007, BOOT Error
Arg2: ffffe283dfd13d58, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------

fffff8021a6b2f08: Unable to get Flags value from nt!KdVersionBlock

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.Sec
    Value: 3

    Key  : Analysis.DebugAnalysisProvider.CPP
    Value: Create: 8007007e on DARTHVADER

    Key  : Analysis.DebugData
    Value: CreateObject

    Key  : Analysis.DebugModel
    Value: CreateObject

    Key  : Analysis.Elapsed.Sec
    Value: 3

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 66

    Key  : Analysis.System
    Value: CreateObject


ADDITIONAL_XML: 1

BUGCHECK_CODE:  124

BUGCHECK_P1: 7

BUGCHECK_P2: ffffe283dfd13d58

BUGCHECK_P3: 0

BUGCHECK_P4: 0

CUSTOMER_CRASH_COUNT:  1

PROCESS_NAME:  System

STACK_TEXT:  
fffffb0f`d8dd45a0 fffff802`1a82fff9 : ffffe283`dfe41080 ffffe283`dfd13d30 fffff802`1a6b75c0 fffff802`00000000 : nt!WheapCreateLiveTriageDump+0x7b
fffffb0f`d8dd4ad0 fffff802`1a5cee78 : ffffe283`dfd13d30 fffff802`1a2e7191 00000000`00000000 00000000`00000000 : nt!WheapCreateTriageDumpFromPreviousSession+0x2d
fffffb0f`d8dd4b00 fffff802`1a5cfb8b : fffff802`1a6b7560 fffff802`1a6b75c0 ffffe283`dffbf680 ffffee84`3b602ed0 : nt!WheapProcessWorkQueueItem+0x48
fffffb0f`d8dd4b40 fffff802`1a39d1fa : ffffee84`3b603150 ffffe283`dfe41080 fffff802`1a867700 ffffe283`00000000 : nt!WheapWorkQueueWorkerRoutine+0x2b
fffffb0f`d8dd4b70 fffff802`1a30a9c5 : ffffe283`dfe41080 ffffe283`df062040 ffffe283`dfe41080 ffffa98f`b9d70268 : nt!ExpWorkerThread+0x16a
fffffb0f`d8dd4c10 fffff802`1a46fdfc : ffff9900`07f80180 ffffe283`dfe41080 fffff802`1a30a970 00000000`00000000 : nt!PspSystemThreadStartup+0x55
fffffb0f`d8dd4c60 00000000`00000000 : fffffb0f`d8dd5000 fffffb0f`d8dcf000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x1c


MODULE_NAME: AuthenticAMD

IMAGE_NAME:  AuthenticAMD.sys

STACK_COMMAND:  .thread ; .cxr ; kb

FAILURE_BUCKET_ID:  0x124_7_AuthenticAMD_IMAGE_AuthenticAMD.sys

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {2c3a2bbf-cef7-9e2b-6149-e5d72c9a4da4}

Followup:     MachineOwner
---------

Highlighted lines were reporting problems for the AuthenticAMD module for the AuthenticAMD.sys driver. More specifically, our stop error was error 0x124 with parameter 1 (subcode) 0x07. That is reported as a boot error but it obviously could not be a boot error since that host was already running.

As we were not able to reproduce that crash on other server we suspected that machine was faulty but we tried to follow a quite standard rule by upgrading drivers and firmwares, starting with the network card which we suspected could be the source of the issue. Other similar or identical machines were running fine using old drivers but we tried to perform that step anyway.

We weren’t surprised at all to discover that, while the issue was somewhat mitigated as that server were crashing less frequently, issue was still there and it hadn’t been solved. While we migrated out the most important machines, we still had services and workloads running on that machine so we decided to decommission it and to replace with an identical one. Before doing that, we performed some hardware tests, pretty standard ones like memory and CPU tests, I/O and so forth. None of them reported any issue, which was a bit surprising.

We decided to decommission the machine and bring a new one in. We started moving those VMs we previously moved to the defective machine and also some VMs out of that specific machine. They have been moved to the new one which was running fine, no crashes at all even when we had most of the needed machines migrated onto it. At this point we were pretty sure it was a hardware problem, though we were unable to spot it.

Back to the drawing table

While we were pretty sure that the issue had been solved and we proceeded to migrate the last few virtual machines to the new host, all of sudden that host started crashing. Worse, it was crashing because of the very same problem and reporting the very same errors in its mini-dumps. Yet no other host experienced the same problem.

Ironically, the “faulty” machine was not crashing anymore which of course ringed all the bells pointing to a problem with one of the VMs. We activated canary machines on the former faulty server, trying to saturate it for CPU load, network load and I/O to check if it really stopped crashing. Which it didn’t do anymore.

Instead, the replacement machine was now crashing 2-3 times per day. Same old story but obviously chances that two machines were faulty were tiny and by the way none of them reported errors or crashes during the hardware tests so we had hints what to look for.

At first we started to look at the most busy/big virtual machines, those that were maybe transferring lots of data over the network or onto the disks. Or even those exhibiting high CPU loads but it turned out we were wrong because crashed continued to happen. At that time we already thought that the problem could not be the running machine per se but running that specific virtual machine onto an AMD CPU so we brought a new Intel server in and started to move some of those machines onto it, carefully, one by one.

That process took a few days because we had to check if the machine was crashing but to our surprise we found out that the culprit was not a big or busy machine but a small Linux 32bit virtual server that we were using very lightly as a load balancer for a couple of other Linux machines. Once we moved that to an Intel-based server, crashing disappeared on the AMD host and they were not happening on the Intel host as well, confirming our suspicions.

Unusual and somewhat… scary !

We have been quite surprised for the results we had. While we were happy to have solved the mistery, it was weird that a single virtual machine could crash the entire host. Somewhat scary, if you want.

In our past experience with Hyper-V, it never happened to us that a single VM, not faulty at all, could crash the entire host and that it was due to some kind of incompaitibility with AMD hardware. Moreover, that was not a busy or big machine, it was just running a 32bit Linux guest OS and was not experiencing any kind of big load.

Hyper-V has been pretty solid and consistent for us and it has been instrumental to our growth yet we now know that such issues could happen and they can be fully related to the guest OS. We are happy with our choice and those AMD servers are delivering amazing performances that some of our customers instantly noted but we need to consider issues that we had not been facing for years. Happy but vigilant, I guess!

What's your reaction?

Love It

50%

Interested

Meh...

What?

50%

Hate it

Sad

Laugh

Sleep

Wow !