Back to news

What could possibly go wrong when migrating a VM?

30 October 2024

Software Engineering

By Doug Szumski (Senior Technical Lead at StackHPC) and Ross Martyn (Cloud Engineering Manager at G-Research)

If there was an award for the most understated homepage, the distinguished French software engineer, Fabrice Bellard, would surely win it.

Somewhere in his list of projects, between a PC emulator in Javascript and the renowned FFMPEG is an entry: “QEMU is a generic machine emulator and virtualizer” .

Despite that inauspicious description, it’s no exaggeration to say that QEMU, used in conjunction with KVM, is a staple of modern cloud computing. Together with a layer of abstraction provided by Libvirt, QEMU/KVM is at the core of the most popular hypervisor driver in OpenStack, helping to under-pin millions of virtualised cores around the globe.^[1]

Virtualised vs bare-metal

So why choose virtualised compute over bare-metal? This may seem an odd question for a high-performance research environment but it turns out that it’s not a binary choice, especially given the degree of virtualisation can be finely tuned, all the way from a fully software emulated machine, to a high-performance behemoth, using a full gamut of hardware acceleration.

Virtual machines are also popular with users. They can reduce cost through efficient utilisation of hardware. They don’t have long self-test routines and at the click of a button you can boot any operating system you like from a high-performance network filesystem. It is little wonder that if you provide the option for virtualised compute, your hypervisors will become a hive of activity.

But what about maintenance?

It is one thing to ask a user to upgrade their fleet but how do you refresh the operating system or firmware of the hypervisors underneath? What happens if you want to power down a rack for maintenance, move it away from a roof leak or the security team has found a new CVE? Your users will demand that you cannot pull the rug from under them.

A key part of the solution to maintaining secure cloud infrastructure is migration. Simply make an exact replica of a few billion virtual transistors on another hypervisor, copy across the contents of the memory and local storage and a GARP or two later, a VM could be switched to a host on the other side of the world.

Almost unbelievably this largely works, and although it can be complicated by hardware acceleration, with a bit of care, even things like virtual GPUs can be moved.^[2]

The Magic of Maintenance

30 Oct 2024

“Doing” open source isn’t just about committing code – it’s also about committing to the upkeep of the code you’ve committed. There’s nothing glamorous about maintenance. You have to really love data to appreciate and maintain Sparkmagic. That’s why G-Research recently volunteered to help maintain the data-liaison. It’s a clear demonstration of our willingness to […]

So what can go wrong in practice?

Here we list some issues that we have fixed – alongside StackHPC – in a busy OpenStack deployment:

1. Live migration unexpectedly fails when migrating large, busy VMs

This issue was particularly tricky. Large VMs, typically with more RAM than your laptop has disk space, would abort a live migration attempt, just as it was about to complete. Fortunately, the migration would fail back to the source hypervisor, leaving the user none-the-wiser.

When attempting to reproduce the issue, the migration always went smoothly until we realised that the failure only occurred when the VM was under heavy load.

Now that we could reproduce it, we tried upgrading to the latest qemu-kvm release. This didn’t help, but resulted in a slightly different failure scenario, with a mention of the TLS connection being terminated and steam emerging from our ears.

Armed with the knowledge that TLS may somehow be involved, we turned it off for testing. Suddenly the migrations were working again and aside from the lack of encryption, everything was running smoothly. But what was it about transferring a large amount of data over a TLS connection that could cause the failure?

Zooming in on the TLS connection, we were using the most recent version of the protocol, TLS 1.3, but not the most recent release of the library. We tried upgrading and that didn’t help, so we tried downgrading. That didn’t help either, so we got the big hammer out and downgraded the protocol to TLS 1.2. Migrations started succeeding and it felt like we were finally on to something.

Scouring for release notes, we stumbled on Nikos Mavrogiannopoulos’ blog. As one of the developers of GnuTLS, he had written a comprehensive post on the new features arriving with TLS 1.3. The section on re-keying stood out:

“Under TLS 1.3 applications can re-key with a very simple mechanism which involves the transmission of a single message. Furthermore, GnuTLS handles re-key transparently and every GnuTLS client and server will automatically re-key after 2^24 messages are exchanged, unless the GNUTLS_NO_AUTO_REKEY flag is specified in gnutls_init(), or the cipher’s security properties requires no re-keying as in the CHACHA20-POLY1305 cipher.”

Assuming the maximum record size of 16KB, a re-key event would be expected every couple of hundred gigabytes. This tied in nicely with the fact that we only saw the migration issues when moving a comparable amount of data.

Checking the QEMU source, we came across a bug report for a nearly identical issue, confirming our suspicions about the handling of AUTO_REKEY. ^[3]

Meanwhile, the immediate pressure of how to move the large VMs was growing. What could be done that didn’t involve disrupting the users? Even if we recompiled QEMU with AUTO_REKEY disabled, we couldn’t just swap it out. The qemu-kvm processes were already live and running.

The solution turned out to be simple.

A neat feature of GnuTLS is that it supports reading a cryptopolicy from a configuration file, despite having been compiled into an application as a library. ^[4] Being Kolla Ansible, qemu-kvm was confined to a container, so we could simply set the system-wide cryptopolicy in the container to use TLS 1.2 without affecting anything else. Even better, we could choose a secure protocol, and we could avoid the REKEY events. But would the policy take effect on a running VM?

Another neat feature of TLS is version negotiation. During the handshake process, an agreement is made on the cryptographic algorithm and version of the protocol to use. ^[5] This allows older applications to communicate with newer ones, and vice versa.

Taking advantage of this, we rolled out containers using the TLS 1.2 crypto policy to some fresh hypervisors. We then live-migrated the large VMs to these hypervisors. A TLS 1.2 session was mutually agreed and as if by magic, the VMs migrated successfully.

2. Surprise changes to Libvirt XML when live migrating from Centos 7 to Centos 8 based hypervisors

The process of migrating a VM has no room for error. The copy operation is not just about the system memory, the state of the CPU and the storage, but also about peripherals: The USB bus, the network interface and the memory card reader that you never used.

A single bit-flip in any of these can lead to immediate doom. So when the Linux kernel increased MAX_TAP_QUEUES from 8 to 256, you can guess what happened on machines with large numbers of vCPUs:

The vNIC queue count went through the roof
Nova spat out “Internal Migration failure: qemu unexpectedly closed the monitor”
And the migration was toast ^[6]^[7]

The solution here was to ensure that Nova preserved the queue count when writing out the machine XML on the destination hypervisor.

3. A two-part catastrophe with an XFS formatted ephemeral volume

A diligent operator was moving a VM when all of a sudden it failed back to the source hypervisor. Was it reproducible? The operator gave it another go, and lo and behold it worked. At least it appeared to have worked. The VM started fine on the destination, the host operating system was running sweetly, but what happened to the ephemeral volume? The contents had vapourised, with an empty block device in its place.

Further analysis of the logs revealed some bizarre errors about the filesystem label exceeding 12 characters. An XFS filesystem had already been created on the source host, so why the error when cloning the volume to the destination?

This one turned out to be an OpenStack Nova bug, where Nova was inexplicably elongating the label on the destination. For file systems excluding XFS, the issue was subtle. The label change would only show up when the volume was remounted, for example after a reboot. You can imagine the hair pulling scenarios.

4. Copying data from the source to destination hypervisor fails during pre-migration

This is OpenStack Nova bug 1939869 and is a relatively simple failure to understand.

Before Nova initiates the live migration of a VM, it first moves some local state associated with the VM to the destination hypervisor. Due to the way that the root filesystem image is layered, this may include the image used to deploy the VM (if it is no longer in Glance), and a config drive (if used).

In this case, to isolate migration traffic, a dedicated live migration network had been defined via Nova config option live_migration_inbound_addr.

However, the pre-migration code was attempting to perform the state transfer using the hostname of the source hypervisor, which resolved to a separate management network. To complicate matters, the remote copy was initiated from the destination hypervisor, and there was no way to retrieve the IP address of the source hypervisor on the dedicated migration network.

The proposed fix is to include the source hypervisor IP in the migration data object, which is then transferred by RPC to the destination hypervisor.

Could these issues have been avoided?

At this stage we will take a step back and reflect on the origin of these issues and how they may have been avoided before affecting end users.

The first issue arises from the complex interaction between two independent software components; GnuTLS and QEMU. Both have highly specialised purposes; the former is a library providing secure communications and the latter is an emulator. It makes complete sense that these are distinct. The problem in this case appears to arise from a new feature, auto re-key, which was turned on by default. There is nothing wrong with this, very much the opposite. The motivation is to improve the baseline security level by reducing the chance of a cryptographic cipher becoming compromised during the transfer of a large amount of data.

The second issue also originates from a software update, this time to the Linux kernel. High performance folk running chunky VMs would have rejoiced at the lifting of the eight queue limit. If you have hundreds of cores contending to send or receive data, a 1:1 mapping of CPUs to TX/RX queues can facilitate a huge performance advantage.

The issue was caused by how the uplift in the queue limit was handled. It’s fine to upgrade the virtual hardware in new VMs, or cold migrate existing workloads to the new world, but live soldering at the sub-device level is a no-no.

The third issue is an obscure bug. Not everyone uses ephemeral storage and not everyone would realise if an ephemeral filesystem label changed. The combined probability of both events is low enough that no-one appears to have noticed.

However, a new level of brokenness occurred when the file system used for ephemeral storage was changed from the default.

The moral here is not to stray from the default path unless there is good reason. This also applies at some level to the fourth issue; not everyone uses a dedicated network for live migration traffic but there are good reasons to do so; particularly in the days when it was more difficult to configure TLS everywhere, or you want to use QoS.

So how could we have prevented these issues affecting users?

We know that defects exist in all software, and so long as the fans are whirring, new challenges will continue to arrive.

The primary defence against new bugs landing in production is the pinning of the software stack at every level. A new release of a package or dependency cannot go live without some human intervention. Coupled with some supporting services, Kolla Ansible is at the core of this. ^[8] However, to remain secure, and state-of-the-art, cloud services must remain current.

Requirements are continually evolving, updates are a necessity and the only way to gain trust in a new release is to thoroughly test it.

At G-Research, this has long been recognised, but the challenge is hard and continually evolving. It is important for test developers to better understand user requirements. The task of providing a secure and reliable cloud computing platform is interminable.

Learn more about our Engineering functions

Our Engineering teams partner with researchers to design real-time platforms and process massive datasets. They solve complex financial problems and are crucial to our success.

Quantitative Engineering

Our quantitative engineers work alongside researchers, using near-infinite compute power to predict the future. We have specialized teams in machine learning, data engineering, and high-performance systems.

Infrastructure Engineering

We deliver a complex infrastructure that processes vast amounts of data daily. Our team innovates and automates, exploring cloud technologies and adopting an infrastructure-as-code mindset.

References

[1] Choosing a hypervisor

[2] OpenStack is More Alive Than Ever with 40 Million Cores in Production

[3] Live migration with TLS fail

[4] Chapter 3. Using system-wide cryptographic policies

[5] What happens in a TLS handshake?

[6] Check if flavor.vcpus is more than MAX_TAP_QUEUES

[7] index: kernel/git/stable/linux.git

[8] https://docs.openstack.org/kolla-ansible/latest/

Latest News

G-Research May 2025 Grant Winners

18 Jun 2025

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our May grant winners.

Read article

G-Research 2025 PhD prize winners: University of Warwick

04 Jun 2025

Every year, G-Research runs a number of different PhD prizes in Maths and Data Science at universities in the UK, Europe and beyond. We're pleased to announce the winners of this prize, run in conjunction with the University of Warwick.

Read article

G-Research 2025 PhD prize winners: University of Oxford

29 May 2025

Read article

Latest Events

Quantitative Engineering
Quantitative Research

G-Research networking drinks at EuroPython 2025

16 Jul 2025 Shared on confirmation of your place

More info

Quantitative Engineering
Quantitative Research

ML in PL Conference 2025

15 Oct 2025 - 18 Oct 2025 Copernicus Science Centre, Warsaw, Poland

More info

Quantitative Engineering
Quantitative Research

SIAM Conference on Financial Mathematics and Engineering

15 Jul 2025 - 18 Jul 2025 Hyatt Regency Miami, 400 SE 2nd St, Miami, FL 33131, United States

More info

What could possibly go wrong when migrating a VM?

Virtualised vs bare-metal

But what about maintenance?

So what can go wrong in practice?

Could these issues have been avoided?

References

Latest News

Latest Events

G-Research networking drinks at EuroPython 2025

ML in PL Conference 2025

SIAM Conference on Financial Mathematics and Engineering

Stay up to date with G-Research

Stay up to date with
G-Research