Concerns with Xen PVH / HVMLite boot on Linux x86
I've been helping a bit with streamlining proper upstream support for Xen on x86 Linux. One of the items I have decided to take on is the so called "dead code" concern in theory present on x86 Linux Xen guests largely in part due to the radical way in which old PV Xen x86 Linux guests boot. This topic is a bit complex, so I had previously written two posts about this to help shed some light into these dark corners of the technical Linux universe that only a few really care about:
Xen has evolved over the years, but so has hardware to help with virtualization. Some say and believe KVM is a much better platform for virtualization than Xen since KVM didn't have to deal with the lack of hardware virtualization support. To a certain degree, part of this is true -- the KVM design has an upper hand in that it has not had to deal with implementing any of the legacy complexities in hardware. If you follow the money in terms of investment, you will notice Moshe Bar, who had co-founded XenSource (later acquired by Citrix) then also c-ofounded Qumranet (later acquired by Red Hat) which was the main company originally behind KVM. In these regards KVM is a natural architectural evolution over Xen. Despite the technical leap forward, this is not to say KVM is simply better, or for instance that KVM cannot possibly have dead code though, or that Xen could not do better. There may be less dead code in KVM on the Linux kernel but in analyzing how dead code comes about I've come to the realization that dead code should be a generic concern all around, the Xen design just exacerbated the concern and took the situation to a whole new level. As it turns out there is also a shit ton of dead code possible in qemu... so perhaps some is saved on KVM, but qemu still has to address this very same problem. This is also not to say that KVM does not paravirtualize. Quite the contrary, its had to also learn from the Xen design -- so it has a paravirtualized clock and devices, but it doesn't have a paravirtualized interface for timers and interrupts, it uses an emulated APIC and so you end up with qemu as a requirement for KVM. As hardware virtualization features evolved, Xen has obviously had to provide support for them as well. This has lead to the complex Paravirtualization spectrum described best in this page. The "sweet spot" for paravirtualization then has evolved over the years, and the latest proposal on the Xen front is called HVMLite. A previous incarnation of this is called Xen PVH design, but this old incarnation is going to be ripped out of the Linux kernel completely as it never really took off for production, HVMLite is the proper replacement, but to avoid complexities with branding the same old name PVH will be used. Here forward I refer to PVH as the new shiny HVMLite design, not the old existing code in the kernel now, as of Linux v4.8 days. What interested me the most of the new PVH design was going to be its proposed alternative boot protocol, which should hopefully address most of the concerns folks had from the previous old legacy PV design. Xen PVH will also not use qemu. With these two things in mind, from one perspective one could actually argue that Xen PVH guests may suffer from less possible dead code than KVM guests. The rest of this post covers some basics over this new PVH design with a focus on the boot protocol, a bit of the evolution of the Linux x86 boot protocol, and where we might be going. I really am writing this mostly for my own note taking, and for future reference, only as secondary in the hopes it may be useful to others.
The given up part here is a bit serious and worrisome. Some folks can give two shits over what goes into Xen to the extent that folks are OK with them merging anything so long as it does not interfere or regress Linux in any way shape or form.
Clean well understood semantics for guests are needed early in boot, we should not allow nasty hacks for virtualization in the kernel, understanding why these hacks creep up, and finding proper solutions for them are extremely important.
I've been told by Xen maintainers that the PVH ABI boot protocol apparently was settled long ago... As someone new to this world, this came as a huge surprise to me given I was not aware of any Linux x86 maintainer having done a thorough evaluation over it, and most importantly if it were an agreed upon acceptable and reasonable protocol this should have been reflected by the fact that those who likely had the biggest concerns over Xen's old boot protocol would have been fans of the new design. That's at least the litmus test I would have used if I would have tried to handle a technical revamp. Unfortunately, as I spoke to different folks, I got the impression most x86 folks simply either had completely given up on Xen or were completely unaware of this new PVH design. The given up part here is a bit serious and worrisome. Some folks can give two shits over what goes into Xen to the extent that folks are OK with them merging anything so long as it does not interfere or regress Linux in any way shape or form. This lost cause attitude has a bit of history, and the PV design I mentioned above is to blame for some of this attitude -- the Xen PV design interfered and regressed Linux often enough it became a burden. The danger in taking a Laissez-faire attitude with Xen in Linux is we are simply not doing our best then, and in doing so users can suffer, and you can only count then on the Xen community to fix things. This... perhaps is the way it should be -- however it also implicates we may not be learning anything from this other than having fear for such type of intrusive technologies in Linux, I believe there is quite a bit to learn from this experience, and there are things we can do better. This later part is the emphasis of my post given that as I'll explain why below, I've also partly given up. There are benefits from taking a proactive approach here, and Xen is not the only one that could benefit from this. It sounds counter intuitive but helping Xen with a clean boot design is not just about addressing a cleaner boot protocol for Xen alone. For instance, consider the loose semantics sprinkled over the kernel for guests which even ended up in a few device drivers -- paravirt_enabled() was one which thanks to some recent efforts by a few is now long gone. This sort of stupid epidemic is not Xen specific -- even KVM has had its own hacks. For instance an audio driver had an "inside_vm" hack for guests, when trying to look for an alternative I was told no possible solution existed, when in fact only 4 days later a completely sensible replacement was found. Clean well understood semantics for guests are needed early in boot, we should not allow nasty hacks for virtualization in the kernel, understanding why these hacks creep up, and finding proper solutions for them are extremely important. Helping review Xen's boot design should help us all avoid seeing cruft land in the kernel long term. It should also pave the way for supporting new radical technologies and architectures using a well streamlined boot protocol.
Let's review the new PVH boot protocol. The last patch set proposal to add PVH to Linux added yet-another-entry-point (TM) by annotating it as an ELF note, this entry was Xen PVH specific. It had some asm code, and finally, it copied boot params and then handed things off to Linux. I was a bit perplexed, I had looked so much into the flaws of the previous PV boot design that I was super paranoid any new entry was simply doomed to be a disaster, so naturally I was extremely suspicious since the very beginning, despite the amount of delta being small and it still using startup_32() and startup_64(). These have become de-facto entry points, grub2 and kexec use them, so another thing using it seems fair. However I learned both that:
- Linux Xen ARM guests use Linux' EFI entry to boot when on Xen
- Windows guests will rely on Window's EFI entry to boot when on Xen
Naturally, my own first observation was to wonder why we can't use EFI to boot x86 Linux on Xen as well. There are a few reasons for this, but perhaps the best summary of the situation is best described by Matt Fleming, the Linux kernel's EFI maintainer:
"Everyone has a different use case in mind and no one has all of them in mind"
Regular guests are known as domU guests. Guests with special privileges are known in Xen as dom0. So if you boot into Xen,and then a Linux control guest OS that's the dom0, you can then spawn domU guests using dom0.
The first obvious concern over exclusively using EFI is that contrary to Windows, Linux needs to support dom0, so then hypercalls would need to talk to EFI. Xen supports dom0 on Linux ARM guests though, but in that case, as George Dunlap clarifies to me, it then relies on the native ARM (as used by uboot) entry path and relies on completely on device tree for hardware information. x86 Linux supports device tree, and has used it on some odd x86 harware, however there are assumptions made for what type of hardware is present. ACPI can and should be used for ironing out discrepancies, however it remains unclear if this would suffice to support all cases required for x86 Linux guests when supporting dom0.
For domU guests an EFI emulation would need to be provided by Xen somehow. But if Windows requires EFI this should be a shared concern. Upon review with Matt -- if one wanted a minimal EFI environment one could only provide the EFI services really needed, we'd also need a way to distinguish bare metal boot and PVH some way by using EFI, Matt has noted that using the EFI GUID seems to be one way to accomplish filling in the required semantics to pass. If EFI was required for domUs though that would mean Xen unikernels (Linux or not) would need to boot EFI. To be clear unikernels can be Linux based as well, they consist of very slim kernels with a small ramdisk and have a single process running as init. George notes that in these cases even an extra megabyte of guest RAM and extra second of boot time is significant cost to incur on guests. He further notes that using OVMF (which would provide EFI) is an excellent solution for domUs when you boot a full Linux distribution, but that it would impose a significant cost on using Linux in unikernel-style VMs. This seems like a fair concern, however its not a reason for why Linux should not be able to use EFI though. In fact supporting to boot Linux x86 with EFI using OVMF seems like a design goal by Xen, after all that would also allow Xen to boot Windows guests without qemu to emulate devices since OVMF will be able to access the PV devices until the PV drivers come around for Windows. Another concern here over requiring EFI is other open operating systems may not support EFI entry points (does NetBSD and FreeBSD not support EFI boot?). The biggest concerns then are the implications to use EFI for dom0, requiring it for small unikernel guests (Linux or not), and the lack of other guest OS support for EFI.
With regards to using EFI to boot Xen PVH -- the devil is in the details. Even if we go the EFI route there's a slight discrepancy between how Xen boots Linux and how Linux's 5 first pre-decompression x86 entry points work -- in particular Linux's EFI entry supports and requires decompression to be done as part of the kernel boot code. On the other hand the Xen hypervisor runs domU Linux guests just like any other regular userspace application: paging is enabled. Linux decompression runs in 32-bit mode with paging disabled, and the code relies on this. The hypervisor does not do the decompression for the domU guest, the toolstack does this, so in this regard the toolstack must support each decompression algorithm used by each supported guest. Also, some VT-x hardware can't run the real-mode code, which makes up the 16-bit boot stub. The exception to this is when Xen boots dom0 Linux, in that case, as Andrew Cooper explains, "the hypervisor contains just enough domain builder code in .init to construct dom0, but this code is discarded before dom0 starts to execute". If one were to resolve the EFI boot issue for Linux, it would not only be useful for PVH, old HVM guests could also use it as well, the only difference would be that HVM guests would use qemu for legacy devices.
Can these issues be resolved though? For instance, can we add a decompression algorithm type that simply skips the decompression? Additionally -- even if these are the reasons to have this new boot method used by Xen for the new PVH -- has this really been fully vetted by everyone ? Are there really no issues with it ? One concern expressed by Alexander Graf recently was that without a boot loader (grub2) you loose the ability to boot from an older btrfs snapshot. Directly booting in this light is a bad idea.
It turns out though that if you want to boot Xen you rely on the Multiboot protocol, originally put out by the FSF long ago, the last proposed new PVH boot patches had borrowed ideas from Multiboot to add an entry to Linux, only it was Xen'ified. What would be Multiboot 2 seemed flexible enough to allow all sorts of custom semantics and information stacked into a boot image. The last thought I had over this topic (before giving up) was-- if we're going to add yet-another-entry (TM) why not add extend Mulitiboot 2 support with the semantics we need to boot any virtual environment and then add Multiboot 2 support entry on Linux? In fact, could such work help unify boot entries over architectures long term in Linux? Is a single unified Linux entry possible?
Using EFI seems to require work and a proof of concept, is there an alternative? For instance -- Alexander Graf wonders why can't the 32-bit entry points be used directly? We would need a PV IO description table -- could merging what we need into ACPI tables suffice to address concerns ? Again, this gets into semantics, as we'd still need to find out if who entered the entry point is a Xen PVH guest or not so we can set up the boot parameters accordingly. One option, for instance is to use CPUID, however CPUID instruction was introduced as of Pentium, so this would fail on i486. Jürgen has noted that we however could probably just detect CPUID support, and this avoid the invalid op code.
In the end talk is cheap. So we need to see code. But hopefully this summarizes enough to understand the issues on both sides. Good luck!