Tuesday, December 22, 2015

Linux asynchronous probe - let's try this again

Bake, Satyr

Updated on 2016-01-19 with description on issue of how systemd limits the number of devices on a Linux system and references to asynchronous work on memory. Edits reflected in this color.

Hipster and trendy init systems want to boot really fast. As of v4.2 the Linux kernel now sports asynchronous probe support (this fix posted December 19, 2015 is needed for use of the generic async_probe module parameter). This isn't the first time such type of work has been attempted on Linux though, this lwn article claims that a long time ago some folks tried to enable asynchronous probe and that ultimately it was reverted due to large number of issues. Among a few things one major difference with the new solution is its opt-in: userspace or drivers must specifically request for it to be used on a driver, We also support blacklisting of asynchronous behavior by annotating a driver requires synchronous probe. All this enables new shiny hipster userspace, while remaining compatible with old userspace and its expectations. At the 2015 Kernel summit it became apparent a few folks still had questions over this, so I decided to write this to help re-cap why the work was done, caveats, its relationship with using -EPROBE_DEFER on your probe routine for making use of the kernel's deferred probe mechanism, to help testing and productizing with asynchronous probe, and also explain a bit of the short term and long term road-map. This post also collects a bit of history of what gave rise to Linux asynchronous probe which I think we can use as a small educational experience on learning how we can better evolve systemd in the community.

So imagine you are wandering around the jungle and you come up on THIS around a corner.  Ok... so he's only about an inch tall, but if you're a cherry tomato or grub that is bad news.

First to be clear -- asynchronous probe isn't supposed to magically make your kernel boot faster, it should however help if you happen to have any driver which for whatever reason tends to have a lot of work done on a driver's probe routine. Even if that's not the case at times using asynchronous probe can shave down kernel boot time even if minimally. Other times it may have no impact at all or perhaps you may see a small increase for any number of reasons. A clear but not obvious gain is the increase for the number of devices a device driver can support, this is explained below. Since this is a new feature we simply don't have enough metrics and enough test coverage yet to determine how helpful it can be so widely, or what issues could creep up, however it was clear some folks wanted and needed it. More importantly using it can also get driver developers and subsystem maintainers thinking about different asynchronous behavior considerations in the kernel that long term should help us in the community. An example is how although asynchronous probe should help with long probes we recently determined that you should by no means consider it as a solution for your probe routine if your driver needs to load firmware on probe and you may have experienced some race issues with this and the filesystem being mounted -- that problem need to be resolved separately (see this firmware_class feature enhancement wiki and this common kernel file loader wiki page for more details and ideas). In lieu of concrete bullet proof solutions for that problem you might be tempted to think asynchronous probe could help and you'd be correct but you should be aware of that this is not a rock solid solution to such problems, it'd be a hack, and this is why its incorrect to use asynchronous probe if you're trying to use it to fix that problem. Another example is how this begs the question of where else should we be using asynchronous mechanisms, and how do we resolve any possible run time dependency issues?

Asynchronous probe support was added for a few reasons, the last three listed here being the major driving factors for getting this developed and merged upstream.
  • Over time there's been a general interest in reducing the kernel's boot time
  • A long time ago in a galaxy far far away... systemd made a really well-intentioned but ultimately incorrect assumption that device driver initialization should take less than 30 seconds, to be more specific the driver's init routine should not take more than 30 seconds. Even as issues started to creep up quite a bit of systemd and kernel developers vocalized strong support for it being a reasonable timeout value. Some users were really upset over this though -- driver loading was being killed after 30 seconds, preventing some drivers from loading completely and in the worst cases if the driver at fault was a storage driver you would not even be able to boot Linux. Because of the strong agreement on both camps there was no exceptions to this rule, and the consensus seemed to be that a lot of drivers should simply be fixed. One puzzle was that issues over drivers being killed due to the timeout were only reported circa 2014, but the timeout was in place systemd for a long time. The reason for this was that commit 786235eeba0 by Tetsuo Handa ("kthread: make kthread_create() killable") enabled kthread_create() to be killed, this was done in particular to enable out of memory killers to kill these type of threads (refer to this lwn article for more details). Prior to this kernel change, the 30 second timeout was never an issue for systemd users given that the SIGKILL signal was never actually respected for these types of threads. Even though the Linux kernel now has asynchronous probe support the original systemd 30 second timeout caused enough headaches for users that on July 29, 2014 Hannes Reinecke ended up merging a way to enable Linux distributions to override the timeout through the command line, refer to systemd commit 9719859c07aa13 ("udevd: add --event-timeout commandline option"). That didn't seem to be enough to help users so on August 30, 2014 Kay Sievers bumped the timeout to 60 seconds via systemd commit 2e92633dbae ("udev: bump event timeout to 60 seconds"). In the end though, on September 10, 2014 Tom Gundersen modified the default timeout to 180 seconds via systemd commit b5338a19864a ("udev: timeout - increase timeout"), the purpose of the timeout, as per the commit log message, now is "to make sure that nothing stays around forever". To help capture in logs possible faulty drivers (or any jobs dispatched) Tom Gundersen also made systemd spit out a warning after 1/3 of the timeout value before killing it via systemd commit 671174136525ddf2 ("udev: timeout - warn after a third of the timeout before killing").
  • It turns out out though that... Linux batches calling a driver init routine and immediately after that its probe routine, synchronously, so naturally any delays on probe should contribute to delays as well. So the systemd timeout is in effect for the run time combination of both init and probe of a device driver. If we provide a way for userspace to ask the driver core to detach these and call probe asynchronously we'd be giving systemd what it thought, and a few kernel developers thought, was actually in place.
  • A delay on your probe means delaying user experience at boot time. If you know off hand your driver might take a while to load preemptively annotating this on your driver can mean giving users a better user experience. Dmitry Torokhov ran into this issue while working on productizing a solution for a popular company where fast boot and a good user experience was critical.
  • It turns out that... a systemd timeout on kmod loader (loading modules) has effect not only on the combination of init + probe of device drivers, but also since the kernel serially probes all devices in the same code path it means if you probe 2 devices the amount of time taken to load your driver will be init time + (number of devices * probe time for each device). What this means is the systemd timeout also places an upper bound limit restriction on the number of devices you can use on a system, this is bound by its init and probe time, and can be computed as follows:
number_devices =          systemd_timeout
                      max known probe time for driver

Drivers can be built-in to the kernel or built as modules so you can load them after the kernel boots as independent and self contained objects. It turns out that in practice striving towards having all modules be probed asynchronously tends to work pretty well, whereas having all built-in drivers will likely crash your kernel with high degree of certainty. This later issue has to do with the fact that as the kernel boots certain assumptions may be made which are not satisfied early on and there's no current easy way to currently order this well. Its similar to why the deferred probe mechanism on the kernel was added -- sometimes the kernel doesn't always have dependency information well sorted out. But fret not, future work should help with this, and such work should help curtail uses of deferred probing and enable more broad asynchronous probe use.

Blood Wolves: Engineer

If you are in control of both hardware and software, that is you have engineers you can pay to productize a solution, you could likely engineer a solution to vet and ensure boot will happen properly and in order for both all built-in and modules on your kernel. There is no easy way to do this, and it is difficult to estimate the amount of work required for this for a device but if you want to try it -- you can use this out of tree debug-async patch and then use the kernel parameters documented there, I summarize them here. Note that using either of these will taint your kernel.

  • __DEBUG__kernel_force_builtin_async_probe - async probe all built-in drivers
  • __DEBUG__kernel_force_modules_async_probe - async probe all modules
If you don't have the luxury of having dedicated hardware and software engineers you could at the very least enable all modules to probe asynchronously and hope for the best and report any issues if found. Its after all what systemd, and what a lot of developers (many kernel developers inclusive), originally thought was happening, so naturally bug reports are welcomed to the driver maintainer if any issues occur. Soon you may see Linux distributions enabling asynchronous probe by default for all modules. The way I'd implement this on systemd is to enable a Linux distribution to opt-in to enable async_probe for specific kernels, given a fix is needed for using the generic async_probe module parameter though one should only ensure to use it if this fix has been merged. This makes it tricky to detect if the module parameter is properly supported or not, enabling it and booting on an older kernel might obviously cause a crash.

Getting drivers to load correctly is just one step, remember that prior to asynchronous probe some userspace expected some device functionality to be available immediately after loading a driver. With asynchronous probe that is no longer the case, userspace must be vetted and tested for to ensure they do not rely on synchronous loading of the drivers.


If you're a driver developer and know that your driver takes a while to boot, you should be aware that it can delay boot / user experience, so you likely should consider annotating on your driver that it prefers asynchronous probe in the driver's source. You can do so as follows:

static struct pci_driver foo_pci_driver = {
      .driver.probe_type = PROBE_PREFER_ASYNCHRONOUS,

An alternative (provided you have this fix merged) is to pass the generic "async_probe" module parameter to the module you want to load, for instance:

modprobe cxgb4 async_probe

Broken Family

Sadly some few drivers cannot work with asynchronous probe at all today, so after testing and if it poops out you should annotate this sort of hard incompatibility. You can do so as follows:

static struct pci_driver foo_pci_driver = {
       .driver.probe_type = PROBE_FORCE_SYNCHRONOUS,

It should be made clear that this sort of incompatibility should likely be seen more as an issue -- if your driver fails at using asynchronous probe chances are that the issues are some subtle architectural design flaw in the driver or dependencies. Fixing it may not necessarily be easy and its precisely for this reason why we have a such a flag to force synchronous probe. Our hope though is that with time we could phase these issues out.

Emperor penguin chicks at play 

Even if we manage to get all drivers working with asynchronous probe we cannot remove synchronous probe as old userspace exists which relies on such behavior, removing synchronous probe support would break old userspace. What we we can strive for long term though is to enable new userspace as best as possible and deal with all asynchronous issues as they come up, slowly, this will take time and serious effort. Over time you should be seeing  more work in this area across subsystems, internals, and perhaps even architecture work. Just to give you a taste of and provide you an example of such type of work done, you review the recent asynchronous work by Mel Gorman on on memory on init through commits 1e8ce83cd17fd0f549a7ad145ddd2bfcdd7dfe37..0e1cc95b4cc7293bb7b39175035e7f7e45c90977, please note these also have a few follow-on fixes. Lastly, obviously some systemd design decisions should be taken with a grain of salt, but they seem to be very well-intentioned, we could use a bit more open and objective communication and design review between more kernel developers and systemd developers. The smoother this gets the smoother the experience we provide to users should be.

Monday, December 14, 2015

Xen and the x86 Linux zero page

This is part II, for part I - refer to "Avoiding dead code: pv_ops is not the silver bullet".

On x86 Linux the boot sequence is rather complicated, so much so that it has its own dedicated boot protocol. This is documented upstream on Documentation/x86/boot.txt. The protocol tends to evolve as the x86 architecture evolves, in order to compensate for new features or extensions which could we need to learn about at boot time. Of interest to this post is the "zero page". The first step when loading a Linux kernel is to load the "zero page", this consists of a the structure struct boot_params, defined in arch/x86/include/uapi/asm/bootparam.h. Its called zero page as unless you're relocating data around, the the zero page is the first physical page of the operating system. The x86 boot protocol originally only had to support 16-bit boot protocol, to do this it required first to load the real-mode code (boot sector and setup code). For modern bootloaders what needs to be loaded is a bit larger, but new bootloaders must still load the same original real-mode code. The struct boot_params accounts for this evolution in requirements, the real-mode section is what is defined in the struct setup_header. The zero page is not only something which we must load, its also part of the actual bzImage we build on x86. One can therefore read a kernel file's struct boot_params as well to extract some details of the kernel. To try this you can play around with parse-bzimage, part of the table-init tree on github. All this sort of stuff is what bootloaders end up working with. Since hypervisors can also boot Linux they must also somehow do the same. This post is about how Xen's zero-page setup design, we'll contrast it to lguest's zero page setup. lguest is a demo 32-bit hypervisor on Linux.

If a hypervisor boots Linux it must also set up the zero page. We'll disect Xen's set up of the zero page backwards, from tracing what we see on Linux down to Xen's setup of the zero page. Xen's entry into Linux x86 for PV type guests (PV, PVH) is set up and annotated on the ELF binary as an ELF note, in particular the XEN_ELFNOTE_ENTRY. On Linux this is visible on arch/x86/xen/xen-head.S as follows:

ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          _ASM_PTR startup_xen)

startup_xen is the respective first entry point of code used on Linux by Xen PV guest types, its defined earlier above in the asm code on the same file. Its implementation is rather simple, enough so we can include it here:

#ifdef CONFIG_X86_32
   mov %esi,xen_start_info
   mov $init_thread_union+THREAD_SIZE,%esp
   mov %rsi,xen_start_info
   mov $init_thread_union+THREAD_SIZE,%rsp
   jmp xen_start_kernel

On x86-64 this sets up what was in rsi to xen_start_info, it then uses what was on rsp to set up the stack before jumping to the first C Linux entry point, xen_start_kernel. The Xen hypervisor must have set up rsi and rsp. This is a bit different than what we expected...

Let's backtrack and show you what perhaps a sane Linux kernel developer expected you to set up, to do this let's look at how lguest loads Linux. lguest's launcher is implemented on tools/lguest/lguest.c. Of interest to us it parses the file we pass it as a Linux kernel binary and tries to launch it via load_kernel(). Read load_bzimage(), it reads the kernel passed, checks the magic string is present and loads the zero page from the file onto its own memory's zero page, finally returning boot.hdr.code32_start. This later part is used to kick off control into the kernel as its starting entry point. But of importance as well to us is that the zero page was read from the file, and used as a base to set up the "zero page". The lguest zero-page is further customized after load_kernel(), lets see a few entries below.

int main(int argc, char *argv[])
  /* Boot information is stashed at physical address 0 */
  boot = from_guest_phys(0);

   * Map the initrd image if requested
   * (at top of physical memory)
  if (initrd_name) { 
    initrd_size = load_initrd(initrd_name, mem);

    /* start and size of the initrd are expected to be found */
    boot->hdr.ramdisk_image = mem - initrd_size;
    boot->hdr.ramdisk_size = initrd_size;

    /* The bootloader type 0xFF means "unknown"; that's OK. */
    boot->hdr.type_of_loader = 0xFF;
   * The Linux boot header contains an "E820" memory
   * map: ours is a simple, single region.
  boot->e820_entries = 1;
  boot->e820_map[0] = ((struct e820entry) { 0, mem, E820_RAM });

   * The boot header contains a command line pointer:
   * we put the command line after the boot header.
  boot->hdr.cmd_line_ptr = to_guest_phys(boot + 1);

   * We use a simple helper to copy the arguments
   * separated by spaces.
  concat((char *)(boot + 1), argv+optind+2);

  /* Set kernel alignment to 16M (CONFIG_PHYSICAL_ALIGN) */
  boot->hdr.kernel_alignment = 0x1000000;

   * Boot protocol version: 2.07 supports the
   * fields for lguest.
  boot->hdr.version = 0x207;

   * The hardware_subarch value of "1" tells the
   * Guest it's an lguest.
  boot->hdr.hardware_subarch = 1;
And that's how sane Linux kernel developers expected you to do Linux kernel loading. Why does Xen's setup look so odd? What's this xen_start_info crap? Let's brace ourselves and dare to have a look at the Xen hypervisor setup code.

Xen defines what it ends up putting into the "xen_start_info" through a data structure it calls struct start_info, defined in xen/include/public/xen.h, it refers to this as the "Start-of-day memory layout". Of interest to us is who sets this up, for x86-64 this is done via vcpu_x86_64(), the relevant parts for is are listed below.

memset(ctxt, 0, sizeof(*ctxt));
ctxt->user_regs.rip = dom->parms.virt_entry;
ctxt->user_regs.rsp = dom->parms.virt_base +
  (dom->bootstack_pfn + 1) * PAGE_SIZE_X86;  
ctxt->user_regs.rsi = dom->parms.virt_base +
  (dom->start_info_pfn) * PAGE_SIZE_X86;

The dom's params are set up via xc_dom_parse_bin_kernel(), as with lguest it has a file parser and uses this to set up some information, and it also extends some information, but it never really sets up the zero-page. Instead it actually sets up its own set of data structures representing the struct start_info. It turns out the setting of the zero-page for PV guests is done once running Linux inside Linux kernel code on the first Xen C entry point for Linux, on xen_start_kernel() on arch/x86/xen/enlighten.c !

/* First C function to be called on Xen boot */
asmlinkage __visible void __init xen_start_kernel(void)
  if (!xen_start_info)
  /* Poke various useful things into boot_params */
  boot_params.hdr.type_of_loader = (9 << 4) | 0;
  boot_params.hdr.ramdisk_image = initrd_start;
  boot_params.hdr.ramdisk_size = xen_start_info->mod_len;
  boot_params.hdr.cmd_line_ptr = __pa(xen_start_info->cmd_line);

Its not documented so I can only infer that the architectural reason for this was to account for the different operating systems that Xen has to support, its perhaps easier to work with a generic data structure, populate that, and then have the kernel specific solution parse it out. While this might have been an original design consideration, it also has implicated a diverging entry point solution for Linux, which as I've highlighted recently in my last post on dead code on pv_ops, isn't ideal for Linux. The challenge to any alternative is to not be disruptive and remain compatible, not extend pv_ops, and providing a generic solution which might be useful elsewhere.

Thursday, December 10, 2015

Avoiding dead code: pv_ops is not the silver bullet

This is part I - for part II - see "Xen and the Linux x86 zero page"

"Code that should not run should never run"

The fact that code that should not run should never run seems like something stupid and obvious but it turns out that its actually easier said than done on very large software projects, particularly on the Linux kernel. One term for this is "dead code". The amount of dead code on Linux has increased over the years due to the desire by Linux distributions to want a single Linux kernel binary to work on different run time environments. The size and complexity of certain features increases the difficulty of proving that dead code never runs. Using a single kernel binary is desirable given that the alternative is we'd have different Linux kernel binary packages for each major custom run time environment we wish to use and among other things this means testing and validating multiple kernels. A really complex modern example, which this post will focus on, is dead code which is possible as a consequence of how we handle support for different hypervisors on the Linux kernel. The purpose of this post is to create awareness about the problem, clean resolutions to these problems have been already integrated upstream for a few features, and you should be seeing a few more soon.

Back in the day you needed a custom kernel binary if you wanted to use the kernel with specific hypervisor support. To solve this the Linux kernel paravirtualization operations, aka paravirt_ops, or even shorter just pv_ops, was chosen as the mechanism to enable different hypervisor solutions to co-exist with a single kernel binary. Although pv_ops was welcomed with open arms back in the days as a reasonable compromise, these days just the mention of "pv_ops" to any kernel developer will cause a cringe. There are a few reasons to hate pv_ops these days, given the praise over it back in the day its perhaps confusing why people hate them so much now, this deserves some attention. Below are a few key reasons why developers hate pv_ops today.

  • pv_ops was designed at a time when hardware assisted virtualization solutions were relatively new, and it remained unclear how fully paravirtualized solutions would compare. KVM is a hypervisor solution that requires hardware assisted virtualization. These days, even originally fully paravirtualized hypervisors solutions such as the Xen hypervisor have integrated support the hardware virtualization extensions put out by several hardware vendors. This makes it difficult to term hypervisors that no longer are "fully paravirtualized", the different possibilities of what could be paravirtualized and be dealt with by hardware has given the rise to a slew of different types of paravirtualized guests. For instance, Xen now has PV, HVM, PVH, check out the virtualization spectrum page for a clarification of how each of these vary. What remains clear though is hardware assisted virtualization features have been welcomed and in the future you should count on all new systems running virtualization to take advantage of them. In the end Xen PHV will provide that sweet spot for the best mixture of "paravirtualization" and hardware virtualization. Architectures which needed hypervisor virtualization support developed after hardware assisted virtualization solutions were in place can support different hypervisors without pv_ops. Such is the case for ARM which supports both KVM and Xen on ARM. In this light, in a way pv_ops is a thing of the past. If Xen slowly deprecates and finally removes fully paravirtualized PV support from the Linux kernel Konrad has noted that at the very least we could deprecate pv_ops MMU components.
  • Although pv_ops was conceived as an architecture agnostic solution in order to support different hypervisors, since hardware assisted virtualization solutions are common, and since evidence shows you can support different hypervisors cleanly without pv_ops and Xen support on ia64 was removed and deprecated, and so pv_ops was also removed from ia64x86 is now the only remaining architecture using pv_ops.
  • Collateral: changes to pv_ops can cause regressions and can impact code for all x86-64 kernel solutions, as such kernel developers are extremely cautious on making additions, extensions, and of even adding new users. To what extent do we not want extensions to pv_ops? Well Rusty Russell wrote the lguest hypervisor and launcher code, he did this to not only demo pv_ops but also set sanity on how folks should write hypervisors for Linux using pv_ops. Rusty wrote this with only 32-bit support though. Although there has been interest in developing 64-bit support on lguest, its simply not welcomed, for at least one of the reasons stated above -- as per hpa: "extending pv_ops is a permanent tax on future development". With the other reasons listed above, this is even more so. If you want to write a demo hypervisor with 64-bit support on x86 the approach you could take is to try to write it with all the fancy new hardware virtualization support and you should try avoiding pv_ops as much as is humanely possible.

So pv_ops was originally the solution put in place to help support different hypervisors on Linux through an architecture agnostic solution. These days, provided we can phase out full Xen PV support, we should strive to only keep what we need to provid support for Xen PHV and the other hardware assisted hypervisors.

The only paravirtualized hypervisors supported upstream on the Linux kernel are Xen for PV guest types (PV, PVH) and the demo lguest hypervisor. lguest is just demo code though, I'd hope no one is using it in production code though... I'd be curious to hear... Assuming no one sane is using lguest as a production hypervisor and we could phase it out, that leaves us with Xen PV solutions as the remaining solution to study to see how we can simplify pv_ops. A highly distinguishing factor of Xen PV guest types (Xen PV, Xen PVH) are that they have a unique separate entry point into Linux when Linux on x86 boots. Xen PV and Xen PVH guest types share this same entry path. That is, even if we wanted to try to remove as much as possible from pv_ops, we'd still currently have to take into account that Xen's modern ideal solution with a mixture of "paravirtualization" and hardware virtualization uses this separate entry path. Trying to summarize this without going into much detail, the different entry points and how x86-64 init works can be summarized as follows.

Bare metal, KVM, Xen HVM                      Xen PV / dom0
    startup_64()                             startup_xen()
           \                                     /
   x86_64_start_kernel()                 xen_start_kernel()
                        \               /
                           [   ...        ]
                           [ setup_arch() ]
                           [   ...        ]

Although this is a small difference, it actually can have a huge impact on possible "dead code". You see, prior to pv_ops different binaries were compiled, and features and solutions which you knew you would not need could simply be negated via Kconfig, these negations were not done upstream -- they were only implemented and integrated on SUSE kernels, as SUSE was perhaps the only enterprise Linux distribution fully supporting Xen. Doing these negations ensures that code we determined should never run, never got compiled in. Although this Kconfig solution was never embraced upstream it doesn't mean the issue didn't exist on upstream, quite the contrary, it obviously did, there was just no clean proposed solution to the problem and frankly no one cared too much about resolving it properly. However an implicit consequence of embracing pv_ops and supporting different hypervisors with one binary is that we're now forced to have large chunks of code always enabled in the Linux kernel, some of which we know should not run once we know what path we're taking on the above tree init path. Code cannot be compiled out, as our differences are now handled at run time. Prior to pv_ops the Kconfig solution was used to negate feature that should not run when on Xen so issues would come up at compile time and could be resolved this way. This Kconfig solution was in no way a proactive solution, but its how Xen support on SUSE kernels was managed. Using pv_ops means we need this resolved through alternative upstream friendly means.

Next are a just a few examples of dead code concerns I have looked into but please note that there are more, I also explain a few of these. Towards the end I explain what I'm working on to do about some of these dead code concerns. Since I hope to have convinced you that people hate pv_ops, the challenge here is to come up with a really clean generic solution that 1) does not extend pv_ops, and 2) could also likely be repurposed for other areas of the kernel.
  • MTRR
  • IOMMU - initialization (resolved cleanly), IOMMU API calls, IOMMU multifuction device conflict. exposed IOMMU ACPI tables (Intel VT-d), 
  • Microcode updates - both early init and changes at run time
As I've studied some of the dead code concerns for some of the above features I've also identified an issue when the main x86 entry path is modified for x86-64 but the Xen's init path is forgotten. When this happens in the worst case you end up crashing Xen. I list two of these cases, one of which is still an issue for Xen. I call these init mismatch issues.
  • cr4 shadow
  • KASan
So both dead code concerns, and init mismatch issues can break things, sometimes really really badly. Some of the solutions in place today and some that will be developed are what I like to refer to as paravirtualization yielding solutions. When reviewing some of these issues below, keep in mind that this is essentially what we're doing, it should help you understand why we're doing what we're doing, or why we need some more work in certain areas of the kernel.

Death to MTRR:

MTRR is an example type of code that we know should not run on when we boot Linux for Xen dom0 or as a guest given that on Linux upstream we never implemented a solution to deal with MTRR with the hypervisor. MTRR calls however are a case that in most cases are not fatal if they fail, typically if MTRR calls fail you'd suffer performance. Since MTRR is really old, we had the option to either add MTRR Linux hypervisor call support for Xen, or work on an alternative that avoided MTRR somehow amicably. Fortunately a long time ago Andy Lutomirski figured we could replace direct MTRR calls with a no-op when on PAT capable systems, provided you also used a PAT friendly respective ioremap call. So he added arch_phys_wc_add() to be used in combination with ioremap_wc(). This solved it for write-combining MTRR calls. He did a bit of the driver conversions needed for this work, it however was never fully completed. If you're following my development upstream you may have noticed that among other things for MTRR I completed where Andy left off, replacing all direct users of write combining MTRR calls upstream on Linux with an architecture agnostic write-combining call, arch_phys_wc_add(), in combination with ioremap_wc(). Instead of adding Linux MTRR hypervisor calls we now have a wrapper which will call MTRR only when we know that is needed, and instead PAT interfaces are used when available. Addressing write-combining MTRR is just one small example though of what we needed to address, there are other types of MTRRs you could use, and in the worst cases they were being used in incredibly hackish, but functional ways. For instance in one case one driver was using two overlapping MTRRs, in the worst case PCI Bar was of 16  MiB but the MMIO region for the device was in the last 4 KiB of the same PCI BAR. You want to avoid write-combining on MMIO regions, but if we use one MTRR for write-combining without affecting the MMIO region we'd end up with 8 MiB of write-combining and loose out on the rest of graphics memory. Using a 16 MiB write-combining MTRR meant we'd write-combine the MMIO region.. The implemented hacky MTRR solution was to issue a 16 MiB write-combining MTRR followed by 4 KiB UC MTRR. There were also two overlapping ioremap calls for this driver. The resolution, in a PAT friendly way included adding ioremap_uc() upstream, which would set PCD=1, PWT=1 on non-PAT systems and use a PAT value of UC for PAT systems. We used this for the MMIO region, doing this ensures that if you then issue on MTRR on this region the MMIO region would remain unaffected. The framebuffer was also carved out cleanly, and ioremap_wc() used on it. For details refer to:

x86/mm, asm-generic: Add IOMMU ioremap_uc() variant default
drivers/video/fbdev/atyfb: Carve out framebuffer length fudging into a helper
drivers/video/fbdev/atyfb: Clarify ioremap() base and length used
drivers/video/fbdev/atyfb: Replace MTRR UC hole with strong UC
drivers/video/fbdev/atyfb: Use arch_phys_wc_add() and ioremap_wc()

But that's not all... even if all drivers have been converted over to never issue MTRR calls directly the BIOS might still issue MTRRs on bootup, and the kernel should have to know about that to avoid issues with conflicts with PAT. More work on this front is therefore needed, but at least the crusade to remove direct access to MTRR was completed on Linux as of v4.3.


A really clean solution to dead code, although it wasn't the only reason for why this went upstream, came from how IOMMU initialization code was handled with IOMMU_INIT macros with struct iommu_table_entry. The solution in place had to account for different dependencies between IOMMU code, this dependency map is best explained by a diagram.

         +----[swiotlb *]--+
        /         |         \
       /          |          \
    [GART]     [Calgary]  [Intel VT-d]

Dependencies are annotated, detection routines made available and there's a sort routine which makes this execute in the right order. The full dependency map is handled at run time, to review some of the implementation check out git log -p 0444ad93e..ee1f28, and just check out the code. When this code was proposed hpa had actually suggested that this sort of problem is common enough that perhaps a generic solution could be implemented on Linux, and that the solution developed by the gPXE folks might be a good one to look at. As neat as this was, this still doesn't address all concerns though. Expect to see some possible suggested updates in this area.

Microcode updates:

A CPU often needs software updates, this is known as CPU microcode updates. If using a hypervisor though your hypervisor should take care of these updates for you as a guest should not have to fix real hardware. Additionally if you do enable a guest to do updates on behalf of a full system you may want to be selective about what guests are allowed to do this. Then there are the run time update considerations. Some CPU microcode updates might disable some CPU ops, if you do this on a hypervisor with code already running some code might break as it assumes some CPU ops still are valid. This could cause some unexpected situations for guests. Doing run time CPU microcode updates after a system has booted then should be avoided and only done if you are 100% certain you can do it, and you have full hardware and software vendor support for it. The CPU microcode update must be designed for a run time update. As far as Linux is concerned we avoid enabling CPU microcode updates by bailing out on the CPU microcode init code if pv_enabled() returns true. This works but it turns out this is not an ideal solution, the reason is that pv_enabled() really should probably be renamed to something such as pv_legacy() as this really only returns true if you have a legacy PV solution. Expect some updates on this upstream soon. If folks desire run time CPU microcode updates on Xen work is required on the Xen side to copy the buffer to Xen, scan the buffer for the correct patch, and finally rendezvous all online cpus in an IPI to apply the patch, and keep the processors in until all have completed the patch. I hacked up a version for the hypervisor which just does queiscing by pausing domains, that obviously needs more work, someone interested should pick up on that. Refer to Xen microcode updates for Xen specific documentation or to read the latest notes on developing this for the Xen hypervisor. At this time, its not clear where KVM keeps this documentation.

Init mismatch issues:

We have a dual entry with x86, we have to live with that now, but at times this is overlooked and it can happen to the best of us. For instance, when Andy Lutomirski added support to shadow the CR4 per CPU on the x86-64 init path he forgot to add a respective call for Xen. This caused a crash on all Xen PV guests and dom0. Boris Ostrovsky fixed this for 64-bit PV(H) guests. I'm told code review is supposed to catch these issues but I'm not satisfied, the fix here was purely reactive. We could and should do better. A perfect example of further complications is when Linux got KASan support, the kernel address sanitizer. Enabling KASan on x86 will crash Xen today, and this issue is not yet fixed. We need a proactive solution. If we could unify init paths, would that help? Would that be welcomed? How could that be possible?

What to do

The purpose of this post is to create awareness of what dead code is, make you believe its real, its important, and that if we could come up with a clean solution that we could probably re-use it for other purposes -- it should welcomed. I'm putting a  lot of emphasis on dead code and init mismatch issues as without this post I probably would not be able to talk to anyone about it and expect them to understand what I'm talking about, let alone have them understand the importance of the issue. The virtualization world is likely not the only place that could use a solution to some of the dead code concern problems. I'll soon be posting RFCs for a possible mechanism to help with this, if you want a taste of what this might look like, you can take a peak at the userspace table-init mockup solution that I've implemented. In short, its a merge of what the gPXE folks implemented with what Konrad worked on for IOMMU initialization, giving us the best of both worlds.