Tuesday, December 22, 2015

Linux asynchronous probe - let's try this again

Bake, Satyr

Updated on 2016-01-19 with description on issue of how systemd limits the number of devices on a Linux system and references to asynchronous work on memory. Edits reflected in this color.

Hipster and trendy init systems want to boot really fast. As of v4.2 the Linux kernel now sports asynchronous probe support (this fix posted December 19, 2015 is needed for use of the generic async_probe module parameter). This isn't the first time such type of work has been attempted on Linux though, this lwn article claims that a long time ago some folks tried to enable asynchronous probe and that ultimately it was reverted due to large number of issues. Among a few things one major difference with the new solution is its opt-in: userspace or drivers must specifically request for it to be used on a driver, We also support blacklisting of asynchronous behavior by annotating a driver requires synchronous probe. All this enables new shiny hipster userspace, while remaining compatible with old userspace and its expectations. At the 2015 Kernel summit it became apparent a few folks still had questions over this, so I decided to write this to help re-cap why the work was done, caveats, its relationship with using -EPROBE_DEFER on your probe routine for making use of the kernel's deferred probe mechanism, to help testing and productizing with asynchronous probe, and also explain a bit of the short term and long term road-map. This post also collects a bit of history of what gave rise to Linux asynchronous probe which I think we can use as a small educational experience on learning how we can better evolve systemd in the community.

So imagine you are wandering around the jungle and you come up on THIS around a corner.  Ok... so he's only about an inch tall, but if you're a cherry tomato or grub that is bad news.

First to be clear -- asynchronous probe isn't supposed to magically make your kernel boot faster, it should however help if you happen to have any driver which for whatever reason tends to have a lot of work done on a driver's probe routine. Even if that's not the case at times using asynchronous probe can shave down kernel boot time even if minimally. Other times it may have no impact at all or perhaps you may see a small increase for any number of reasons. A clear but not obvious gain is the increase for the number of devices a device driver can support, this is explained below. Since this is a new feature we simply don't have enough metrics and enough test coverage yet to determine how helpful it can be so widely, or what issues could creep up, however it was clear some folks wanted and needed it. More importantly using it can also get driver developers and subsystem maintainers thinking about different asynchronous behavior considerations in the kernel that long term should help us in the community. An example is how although asynchronous probe should help with long probes we recently determined that you should by no means consider it as a solution for your probe routine if your driver needs to load firmware on probe and you may have experienced some race issues with this and the filesystem being mounted -- that problem need to be resolved separately (see this firmware_class feature enhancement wiki and this common kernel file loader wiki page for more details and ideas). In lieu of concrete bullet proof solutions for that problem you might be tempted to think asynchronous probe could help and you'd be correct but you should be aware of that this is not a rock solid solution to such problems, it'd be a hack, and this is why its incorrect to use asynchronous probe if you're trying to use it to fix that problem. Another example is how this begs the question of where else should we be using asynchronous mechanisms, and how do we resolve any possible run time dependency issues?

Asynchronous probe support was added for a few reasons, the last three listed here being the major driving factors for getting this developed and merged upstream.
  • Over time there's been a general interest in reducing the kernel's boot time
  • A long time ago in a galaxy far far away... systemd made a really well-intentioned but ultimately incorrect assumption that device driver initialization should take less than 30 seconds, to be more specific the driver's init routine should not take more than 30 seconds. Even as issues started to creep up quite a bit of systemd and kernel developers vocalized strong support for it being a reasonable timeout value. Some users were really upset over this though -- driver loading was being killed after 30 seconds, preventing some drivers from loading completely and in the worst cases if the driver at fault was a storage driver you would not even be able to boot Linux. Because of the strong agreement on both camps there was no exceptions to this rule, and the consensus seemed to be that a lot of drivers should simply be fixed. One puzzle was that issues over drivers being killed due to the timeout were only reported circa 2014, but the timeout was in place systemd for a long time. The reason for this was that commit 786235eeba0 by Tetsuo Handa ("kthread: make kthread_create() killable") enabled kthread_create() to be killed, this was done in particular to enable out of memory killers to kill these type of threads (refer to this lwn article for more details). Prior to this kernel change, the 30 second timeout was never an issue for systemd users given that the SIGKILL signal was never actually respected for these types of threads. Even though the Linux kernel now has asynchronous probe support the original systemd 30 second timeout caused enough headaches for users that on July 29, 2014 Hannes Reinecke ended up merging a way to enable Linux distributions to override the timeout through the command line, refer to systemd commit 9719859c07aa13 ("udevd: add --event-timeout commandline option"). That didn't seem to be enough to help users so on August 30, 2014 Kay Sievers bumped the timeout to 60 seconds via systemd commit 2e92633dbae ("udev: bump event timeout to 60 seconds"). In the end though, on September 10, 2014 Tom Gundersen modified the default timeout to 180 seconds via systemd commit b5338a19864a ("udev: timeout - increase timeout"), the purpose of the timeout, as per the commit log message, now is "to make sure that nothing stays around forever". To help capture in logs possible faulty drivers (or any jobs dispatched) Tom Gundersen also made systemd spit out a warning after 1/3 of the timeout value before killing it via systemd commit 671174136525ddf2 ("udev: timeout - warn after a third of the timeout before killing").
  • It turns out out though that... Linux batches calling a driver init routine and immediately after that its probe routine, synchronously, so naturally any delays on probe should contribute to delays as well. So the systemd timeout is in effect for the run time combination of both init and probe of a device driver. If we provide a way for userspace to ask the driver core to detach these and call probe asynchronously we'd be giving systemd what it thought, and a few kernel developers thought, was actually in place.
  • A delay on your probe means delaying user experience at boot time. If you know off hand your driver might take a while to load preemptively annotating this on your driver can mean giving users a better user experience. Dmitry Torokhov ran into this issue while working on productizing a solution for a popular company where fast boot and a good user experience was critical.
  • It turns out that... a systemd timeout on kmod loader (loading modules) has effect not only on the combination of init + probe of device drivers, but also since the kernel serially probes all devices in the same code path it means if you probe 2 devices the amount of time taken to load your driver will be init time + (number of devices * probe time for each device). What this means is the systemd timeout also places an upper bound limit restriction on the number of devices you can use on a system, this is bound by its init and probe time, and can be computed as follows:
number_devices =          systemd_timeout
                      max known probe time for driver

Drivers can be built-in to the kernel or built as modules so you can load them after the kernel boots as independent and self contained objects. It turns out that in practice striving towards having all modules be probed asynchronously tends to work pretty well, whereas having all built-in drivers will likely crash your kernel with high degree of certainty. This later issue has to do with the fact that as the kernel boots certain assumptions may be made which are not satisfied early on and there's no current easy way to currently order this well. Its similar to why the deferred probe mechanism on the kernel was added -- sometimes the kernel doesn't always have dependency information well sorted out. But fret not, future work should help with this, and such work should help curtail uses of deferred probing and enable more broad asynchronous probe use.

Blood Wolves: Engineer

If you are in control of both hardware and software, that is you have engineers you can pay to productize a solution, you could likely engineer a solution to vet and ensure boot will happen properly and in order for both all built-in and modules on your kernel. There is no easy way to do this, and it is difficult to estimate the amount of work required for this for a device but if you want to try it -- you can use this out of tree debug-async patch and then use the kernel parameters documented there, I summarize them here. Note that using either of these will taint your kernel.

  • __DEBUG__kernel_force_builtin_async_probe - async probe all built-in drivers
  • __DEBUG__kernel_force_modules_async_probe - async probe all modules
If you don't have the luxury of having dedicated hardware and software engineers you could at the very least enable all modules to probe asynchronously and hope for the best and report any issues if found. Its after all what systemd, and what a lot of developers (many kernel developers inclusive), originally thought was happening, so naturally bug reports are welcomed to the driver maintainer if any issues occur. Soon you may see Linux distributions enabling asynchronous probe by default for all modules. The way I'd implement this on systemd is to enable a Linux distribution to opt-in to enable async_probe for specific kernels, given a fix is needed for using the generic async_probe module parameter though one should only ensure to use it if this fix has been merged. This makes it tricky to detect if the module parameter is properly supported or not, enabling it and booting on an older kernel might obviously cause a crash.

Getting drivers to load correctly is just one step, remember that prior to asynchronous probe some userspace expected some device functionality to be available immediately after loading a driver. With asynchronous probe that is no longer the case, userspace must be vetted and tested for to ensure they do not rely on synchronous loading of the drivers.


If you're a driver developer and know that your driver takes a while to boot, you should be aware that it can delay boot / user experience, so you likely should consider annotating on your driver that it prefers asynchronous probe in the driver's source. You can do so as follows:

static struct pci_driver foo_pci_driver = {
      .driver.probe_type = PROBE_PREFER_ASYNCHRONOUS,

An alternative (provided you have this fix merged) is to pass the generic "async_probe" module parameter to the module you want to load, for instance:

modprobe cxgb4 async_probe

Broken Family

Sadly some few drivers cannot work with asynchronous probe at all today, so after testing and if it poops out you should annotate this sort of hard incompatibility. You can do so as follows:

static struct pci_driver foo_pci_driver = {
       .driver.probe_type = PROBE_FORCE_SYNCHRONOUS,

It should be made clear that this sort of incompatibility should likely be seen more as an issue -- if your driver fails at using asynchronous probe chances are that the issues are some subtle architectural design flaw in the driver or dependencies. Fixing it may not necessarily be easy and its precisely for this reason why we have a such a flag to force synchronous probe. Our hope though is that with time we could phase these issues out.

Emperor penguin chicks at play 

Even if we manage to get all drivers working with asynchronous probe we cannot remove synchronous probe as old userspace exists which relies on such behavior, removing synchronous probe support would break old userspace. What we we can strive for long term though is to enable new userspace as best as possible and deal with all asynchronous issues as they come up, slowly, this will take time and serious effort. Over time you should be seeing  more work in this area across subsystems, internals, and perhaps even architecture work. Just to give you a taste of and provide you an example of such type of work done, you review the recent asynchronous work by Mel Gorman on on memory on init through commits 1e8ce83cd17fd0f549a7ad145ddd2bfcdd7dfe37..0e1cc95b4cc7293bb7b39175035e7f7e45c90977, please note these also have a few follow-on fixes. Lastly, obviously some systemd design decisions should be taken with a grain of salt, but they seem to be very well-intentioned, we could use a bit more open and objective communication and design review between more kernel developers and systemd developers. The smoother this gets the smoother the experience we provide to users should be.

Monday, December 14, 2015

Xen and the x86 Linux zero page

This is part II, for part I - refer to "Avoiding dead code: pv_ops is not the silver bullet".

On x86 Linux the boot sequence is rather complicated, so much so that it has its own dedicated boot protocol. This is documented upstream on Documentation/x86/boot.txt. The protocol tends to evolve as the x86 architecture evolves, in order to compensate for new features or extensions which could we need to learn about at boot time. Of interest to this post is the "zero page". The first step when loading a Linux kernel is to load the "zero page", this consists of a the structure struct boot_params, defined in arch/x86/include/uapi/asm/bootparam.h. Its called zero page as unless you're relocating data around, the the zero page is the first physical page of the operating system. The x86 boot protocol originally only had to support 16-bit boot protocol, to do this it required first to load the real-mode code (boot sector and setup code). For modern bootloaders what needs to be loaded is a bit larger, but new bootloaders must still load the same original real-mode code. The struct boot_params accounts for this evolution in requirements, the real-mode section is what is defined in the struct setup_header. The zero page is not only something which we must load, its also part of the actual bzImage we build on x86. One can therefore read a kernel file's struct boot_params as well to extract some details of the kernel. To try this you can play around with parse-bzimage, part of the table-init tree on github. All this sort of stuff is what bootloaders end up working with. Since hypervisors can also boot Linux they must also somehow do the same. This post is about how Xen's zero-page setup design, we'll contrast it to lguest's zero page setup. lguest is a demo 32-bit hypervisor on Linux.

If a hypervisor boots Linux it must also set up the zero page. We'll disect Xen's set up of the zero page backwards, from tracing what we see on Linux down to Xen's setup of the zero page. Xen's entry into Linux x86 for PV type guests (PV, PVH) is set up and annotated on the ELF binary as an ELF note, in particular the XEN_ELFNOTE_ENTRY. On Linux this is visible on arch/x86/xen/xen-head.S as follows:

ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          _ASM_PTR startup_xen)

startup_xen is the respective first entry point of code used on Linux by Xen PV guest types, its defined earlier above in the asm code on the same file. Its implementation is rather simple, enough so we can include it here:

#ifdef CONFIG_X86_32
   mov %esi,xen_start_info
   mov $init_thread_union+THREAD_SIZE,%esp
   mov %rsi,xen_start_info
   mov $init_thread_union+THREAD_SIZE,%rsp
   jmp xen_start_kernel

On x86-64 this sets up what was in rsi to xen_start_info, it then uses what was on rsp to set up the stack before jumping to the first C Linux entry point, xen_start_kernel. The Xen hypervisor must have set up rsi and rsp. This is a bit different than what we expected...

Let's backtrack and show you what perhaps a sane Linux kernel developer expected you to set up, to do this let's look at how lguest loads Linux. lguest's launcher is implemented on tools/lguest/lguest.c. Of interest to us it parses the file we pass it as a Linux kernel binary and tries to launch it via load_kernel(). Read load_bzimage(), it reads the kernel passed, checks the magic string is present and loads the zero page from the file onto its own memory's zero page, finally returning boot.hdr.code32_start. This later part is used to kick off control into the kernel as its starting entry point. But of importance as well to us is that the zero page was read from the file, and used as a base to set up the "zero page". The lguest zero-page is further customized after load_kernel(), lets see a few entries below.

int main(int argc, char *argv[])
  /* Boot information is stashed at physical address 0 */
  boot = from_guest_phys(0);

   * Map the initrd image if requested
   * (at top of physical memory)
  if (initrd_name) { 
    initrd_size = load_initrd(initrd_name, mem);

    /* start and size of the initrd are expected to be found */
    boot->hdr.ramdisk_image = mem - initrd_size;
    boot->hdr.ramdisk_size = initrd_size;

    /* The bootloader type 0xFF means "unknown"; that's OK. */
    boot->hdr.type_of_loader = 0xFF;
   * The Linux boot header contains an "E820" memory
   * map: ours is a simple, single region.
  boot->e820_entries = 1;
  boot->e820_map[0] = ((struct e820entry) { 0, mem, E820_RAM });

   * The boot header contains a command line pointer:
   * we put the command line after the boot header.
  boot->hdr.cmd_line_ptr = to_guest_phys(boot + 1);

   * We use a simple helper to copy the arguments
   * separated by spaces.
  concat((char *)(boot + 1), argv+optind+2);

  /* Set kernel alignment to 16M (CONFIG_PHYSICAL_ALIGN) */
  boot->hdr.kernel_alignment = 0x1000000;

   * Boot protocol version: 2.07 supports the
   * fields for lguest.
  boot->hdr.version = 0x207;

   * The hardware_subarch value of "1" tells the
   * Guest it's an lguest.
  boot->hdr.hardware_subarch = 1;
And that's how sane Linux kernel developers expected you to do Linux kernel loading. Why does Xen's setup look so odd? What's this xen_start_info crap? Let's brace ourselves and dare to have a look at the Xen hypervisor setup code.

Xen defines what it ends up putting into the "xen_start_info" through a data structure it calls struct start_info, defined in xen/include/public/xen.h, it refers to this as the "Start-of-day memory layout". Of interest to us is who sets this up, for x86-64 this is done via vcpu_x86_64(), the relevant parts for is are listed below.

memset(ctxt, 0, sizeof(*ctxt));
ctxt->user_regs.rip = dom->parms.virt_entry;
ctxt->user_regs.rsp = dom->parms.virt_base +
  (dom->bootstack_pfn + 1) * PAGE_SIZE_X86;  
ctxt->user_regs.rsi = dom->parms.virt_base +
  (dom->start_info_pfn) * PAGE_SIZE_X86;

The dom's params are set up via xc_dom_parse_bin_kernel(), as with lguest it has a file parser and uses this to set up some information, and it also extends some information, but it never really sets up the zero-page. Instead it actually sets up its own set of data structures representing the struct start_info. It turns out the setting of the zero-page for PV guests is done once running Linux inside Linux kernel code on the first Xen C entry point for Linux, on xen_start_kernel() on arch/x86/xen/enlighten.c !

/* First C function to be called on Xen boot */
asmlinkage __visible void __init xen_start_kernel(void)
  if (!xen_start_info)
  /* Poke various useful things into boot_params */
  boot_params.hdr.type_of_loader = (9 << 4) | 0;
  boot_params.hdr.ramdisk_image = initrd_start;
  boot_params.hdr.ramdisk_size = xen_start_info->mod_len;
  boot_params.hdr.cmd_line_ptr = __pa(xen_start_info->cmd_line);

Its not documented so I can only infer that the architectural reason for this was to account for the different operating systems that Xen has to support, its perhaps easier to work with a generic data structure, populate that, and then have the kernel specific solution parse it out. While this might have been an original design consideration, it also has implicated a diverging entry point solution for Linux, which as I've highlighted recently in my last post on dead code on pv_ops, isn't ideal for Linux. The challenge to any alternative is to not be disruptive and remain compatible, not extend pv_ops, and providing a generic solution which might be useful elsewhere.

Thursday, December 10, 2015

Avoiding dead code: pv_ops is not the silver bullet

This is part I - for part II - see "Xen and the Linux x86 zero page"

"Code that should not run should never run"

The fact that code that should not run should never run seems like something stupid and obvious but it turns out that its actually easier said than done on very large software projects, particularly on the Linux kernel. One term for this is "dead code". The amount of dead code on Linux has increased over the years due to the desire by Linux distributions to want a single Linux kernel binary to work on different run time environments. The size and complexity of certain features increases the difficulty of proving that dead code never runs. Using a single kernel binary is desirable given that the alternative is we'd have different Linux kernel binary packages for each major custom run time environment we wish to use and among other things this means testing and validating multiple kernels. A really complex modern example, which this post will focus on, is dead code which is possible as a consequence of how we handle support for different hypervisors on the Linux kernel. The purpose of this post is to create awareness about the problem, clean resolutions to these problems have been already integrated upstream for a few features, and you should be seeing a few more soon.

Back in the day you needed a custom kernel binary if you wanted to use the kernel with specific hypervisor support. To solve this the Linux kernel paravirtualization operations, aka paravirt_ops, or even shorter just pv_ops, was chosen as the mechanism to enable different hypervisor solutions to co-exist with a single kernel binary. Although pv_ops was welcomed with open arms back in the days as a reasonable compromise, these days just the mention of "pv_ops" to any kernel developer will cause a cringe. There are a few reasons to hate pv_ops these days, given the praise over it back in the day its perhaps confusing why people hate them so much now, this deserves some attention. Below are a few key reasons why developers hate pv_ops today.

  • pv_ops was designed at a time when hardware assisted virtualization solutions were relatively new, and it remained unclear how fully paravirtualized solutions would compare. KVM is a hypervisor solution that requires hardware assisted virtualization. These days, even originally fully paravirtualized hypervisors solutions such as the Xen hypervisor have integrated support the hardware virtualization extensions put out by several hardware vendors. This makes it difficult to term hypervisors that no longer are "fully paravirtualized", the different possibilities of what could be paravirtualized and be dealt with by hardware has given the rise to a slew of different types of paravirtualized guests. For instance, Xen now has PV, HVM, PVH, check out the virtualization spectrum page for a clarification of how each of these vary. What remains clear though is hardware assisted virtualization features have been welcomed and in the future you should count on all new systems running virtualization to take advantage of them. In the end Xen PHV will provide that sweet spot for the best mixture of "paravirtualization" and hardware virtualization. Architectures which needed hypervisor virtualization support developed after hardware assisted virtualization solutions were in place can support different hypervisors without pv_ops. Such is the case for ARM which supports both KVM and Xen on ARM. In this light, in a way pv_ops is a thing of the past. If Xen slowly deprecates and finally removes fully paravirtualized PV support from the Linux kernel Konrad has noted that at the very least we could deprecate pv_ops MMU components.
  • Although pv_ops was conceived as an architecture agnostic solution in order to support different hypervisors, since hardware assisted virtualization solutions are common, and since evidence shows you can support different hypervisors cleanly without pv_ops and Xen support on ia64 was removed and deprecated, and so pv_ops was also removed from ia64x86 is now the only remaining architecture using pv_ops.
  • Collateral: changes to pv_ops can cause regressions and can impact code for all x86-64 kernel solutions, as such kernel developers are extremely cautious on making additions, extensions, and of even adding new users. To what extent do we not want extensions to pv_ops? Well Rusty Russell wrote the lguest hypervisor and launcher code, he did this to not only demo pv_ops but also set sanity on how folks should write hypervisors for Linux using pv_ops. Rusty wrote this with only 32-bit support though. Although there has been interest in developing 64-bit support on lguest, its simply not welcomed, for at least one of the reasons stated above -- as per hpa: "extending pv_ops is a permanent tax on future development". With the other reasons listed above, this is even more so. If you want to write a demo hypervisor with 64-bit support on x86 the approach you could take is to try to write it with all the fancy new hardware virtualization support and you should try avoiding pv_ops as much as is humanely possible.

So pv_ops was originally the solution put in place to help support different hypervisors on Linux through an architecture agnostic solution. These days, provided we can phase out full Xen PV support, we should strive to only keep what we need to provid support for Xen PHV and the other hardware assisted hypervisors.

The only paravirtualized hypervisors supported upstream on the Linux kernel are Xen for PV guest types (PV, PVH) and the demo lguest hypervisor. lguest is just demo code though, I'd hope no one is using it in production code though... I'd be curious to hear... Assuming no one sane is using lguest as a production hypervisor and we could phase it out, that leaves us with Xen PV solutions as the remaining solution to study to see how we can simplify pv_ops. A highly distinguishing factor of Xen PV guest types (Xen PV, Xen PVH) are that they have a unique separate entry point into Linux when Linux on x86 boots. Xen PV and Xen PVH guest types share this same entry path. That is, even if we wanted to try to remove as much as possible from pv_ops, we'd still currently have to take into account that Xen's modern ideal solution with a mixture of "paravirtualization" and hardware virtualization uses this separate entry path. Trying to summarize this without going into much detail, the different entry points and how x86-64 init works can be summarized as follows.

Bare metal, KVM, Xen HVM                      Xen PV / dom0
    startup_64()                             startup_xen()
           \                                     /
   x86_64_start_kernel()                 xen_start_kernel()
                        \               /
                           [   ...        ]
                           [ setup_arch() ]
                           [   ...        ]

Although this is a small difference, it actually can have a huge impact on possible "dead code". You see, prior to pv_ops different binaries were compiled, and features and solutions which you knew you would not need could simply be negated via Kconfig, these negations were not done upstream -- they were only implemented and integrated on SUSE kernels, as SUSE was perhaps the only enterprise Linux distribution fully supporting Xen. Doing these negations ensures that code we determined should never run, never got compiled in. Although this Kconfig solution was never embraced upstream it doesn't mean the issue didn't exist on upstream, quite the contrary, it obviously did, there was just no clean proposed solution to the problem and frankly no one cared too much about resolving it properly. However an implicit consequence of embracing pv_ops and supporting different hypervisors with one binary is that we're now forced to have large chunks of code always enabled in the Linux kernel, some of which we know should not run once we know what path we're taking on the above tree init path. Code cannot be compiled out, as our differences are now handled at run time. Prior to pv_ops the Kconfig solution was used to negate feature that should not run when on Xen so issues would come up at compile time and could be resolved this way. This Kconfig solution was in no way a proactive solution, but its how Xen support on SUSE kernels was managed. Using pv_ops means we need this resolved through alternative upstream friendly means.

Next are a just a few examples of dead code concerns I have looked into but please note that there are more, I also explain a few of these. Towards the end I explain what I'm working on to do about some of these dead code concerns. Since I hope to have convinced you that people hate pv_ops, the challenge here is to come up with a really clean generic solution that 1) does not extend pv_ops, and 2) could also likely be repurposed for other areas of the kernel.
  • MTRR
  • IOMMU - initialization (resolved cleanly), IOMMU API calls, IOMMU multifuction device conflict. exposed IOMMU ACPI tables (Intel VT-d), 
  • Microcode updates - both early init and changes at run time
As I've studied some of the dead code concerns for some of the above features I've also identified an issue when the main x86 entry path is modified for x86-64 but the Xen's init path is forgotten. When this happens in the worst case you end up crashing Xen. I list two of these cases, one of which is still an issue for Xen. I call these init mismatch issues.
  • cr4 shadow
  • KASan
So both dead code concerns, and init mismatch issues can break things, sometimes really really badly. Some of the solutions in place today and some that will be developed are what I like to refer to as paravirtualization yielding solutions. When reviewing some of these issues below, keep in mind that this is essentially what we're doing, it should help you understand why we're doing what we're doing, or why we need some more work in certain areas of the kernel.

Death to MTRR:

MTRR is an example type of code that we know should not run on when we boot Linux for Xen dom0 or as a guest given that on Linux upstream we never implemented a solution to deal with MTRR with the hypervisor. MTRR calls however are a case that in most cases are not fatal if they fail, typically if MTRR calls fail you'd suffer performance. Since MTRR is really old, we had the option to either add MTRR Linux hypervisor call support for Xen, or work on an alternative that avoided MTRR somehow amicably. Fortunately a long time ago Andy Lutomirski figured we could replace direct MTRR calls with a no-op when on PAT capable systems, provided you also used a PAT friendly respective ioremap call. So he added arch_phys_wc_add() to be used in combination with ioremap_wc(). This solved it for write-combining MTRR calls. He did a bit of the driver conversions needed for this work, it however was never fully completed. If you're following my development upstream you may have noticed that among other things for MTRR I completed where Andy left off, replacing all direct users of write combining MTRR calls upstream on Linux with an architecture agnostic write-combining call, arch_phys_wc_add(), in combination with ioremap_wc(). Instead of adding Linux MTRR hypervisor calls we now have a wrapper which will call MTRR only when we know that is needed, and instead PAT interfaces are used when available. Addressing write-combining MTRR is just one small example though of what we needed to address, there are other types of MTRRs you could use, and in the worst cases they were being used in incredibly hackish, but functional ways. For instance in one case one driver was using two overlapping MTRRs, in the worst case PCI Bar was of 16  MiB but the MMIO region for the device was in the last 4 KiB of the same PCI BAR. You want to avoid write-combining on MMIO regions, but if we use one MTRR for write-combining without affecting the MMIO region we'd end up with 8 MiB of write-combining and loose out on the rest of graphics memory. Using a 16 MiB write-combining MTRR meant we'd write-combine the MMIO region.. The implemented hacky MTRR solution was to issue a 16 MiB write-combining MTRR followed by 4 KiB UC MTRR. There were also two overlapping ioremap calls for this driver. The resolution, in a PAT friendly way included adding ioremap_uc() upstream, which would set PCD=1, PWT=1 on non-PAT systems and use a PAT value of UC for PAT systems. We used this for the MMIO region, doing this ensures that if you then issue on MTRR on this region the MMIO region would remain unaffected. The framebuffer was also carved out cleanly, and ioremap_wc() used on it. For details refer to:

x86/mm, asm-generic: Add IOMMU ioremap_uc() variant default
drivers/video/fbdev/atyfb: Carve out framebuffer length fudging into a helper
drivers/video/fbdev/atyfb: Clarify ioremap() base and length used
drivers/video/fbdev/atyfb: Replace MTRR UC hole with strong UC
drivers/video/fbdev/atyfb: Use arch_phys_wc_add() and ioremap_wc()

But that's not all... even if all drivers have been converted over to never issue MTRR calls directly the BIOS might still issue MTRRs on bootup, and the kernel should have to know about that to avoid issues with conflicts with PAT. More work on this front is therefore needed, but at least the crusade to remove direct access to MTRR was completed on Linux as of v4.3.


A really clean solution to dead code, although it wasn't the only reason for why this went upstream, came from how IOMMU initialization code was handled with IOMMU_INIT macros with struct iommu_table_entry. The solution in place had to account for different dependencies between IOMMU code, this dependency map is best explained by a diagram.

         +----[swiotlb *]--+
        /         |         \
       /          |          \
    [GART]     [Calgary]  [Intel VT-d]

Dependencies are annotated, detection routines made available and there's a sort routine which makes this execute in the right order. The full dependency map is handled at run time, to review some of the implementation check out git log -p 0444ad93e..ee1f28, and just check out the code. When this code was proposed hpa had actually suggested that this sort of problem is common enough that perhaps a generic solution could be implemented on Linux, and that the solution developed by the gPXE folks might be a good one to look at. As neat as this was, this still doesn't address all concerns though. Expect to see some possible suggested updates in this area.

Microcode updates:

A CPU often needs software updates, this is known as CPU microcode updates. If using a hypervisor though your hypervisor should take care of these updates for you as a guest should not have to fix real hardware. Additionally if you do enable a guest to do updates on behalf of a full system you may want to be selective about what guests are allowed to do this. Then there are the run time update considerations. Some CPU microcode updates might disable some CPU ops, if you do this on a hypervisor with code already running some code might break as it assumes some CPU ops still are valid. This could cause some unexpected situations for guests. Doing run time CPU microcode updates after a system has booted then should be avoided and only done if you are 100% certain you can do it, and you have full hardware and software vendor support for it. The CPU microcode update must be designed for a run time update. As far as Linux is concerned we avoid enabling CPU microcode updates by bailing out on the CPU microcode init code if pv_enabled() returns true. This works but it turns out this is not an ideal solution, the reason is that pv_enabled() really should probably be renamed to something such as pv_legacy() as this really only returns true if you have a legacy PV solution. Expect some updates on this upstream soon. If folks desire run time CPU microcode updates on Xen work is required on the Xen side to copy the buffer to Xen, scan the buffer for the correct patch, and finally rendezvous all online cpus in an IPI to apply the patch, and keep the processors in until all have completed the patch. I hacked up a version for the hypervisor which just does queiscing by pausing domains, that obviously needs more work, someone interested should pick up on that. Refer to Xen microcode updates for Xen specific documentation or to read the latest notes on developing this for the Xen hypervisor. At this time, its not clear where KVM keeps this documentation.

Init mismatch issues:

We have a dual entry with x86, we have to live with that now, but at times this is overlooked and it can happen to the best of us. For instance, when Andy Lutomirski added support to shadow the CR4 per CPU on the x86-64 init path he forgot to add a respective call for Xen. This caused a crash on all Xen PV guests and dom0. Boris Ostrovsky fixed this for 64-bit PV(H) guests. I'm told code review is supposed to catch these issues but I'm not satisfied, the fix here was purely reactive. We could and should do better. A perfect example of further complications is when Linux got KASan support, the kernel address sanitizer. Enabling KASan on x86 will crash Xen today, and this issue is not yet fixed. We need a proactive solution. If we could unify init paths, would that help? Would that be welcomed? How could that be possible?

What to do

The purpose of this post is to create awareness of what dead code is, make you believe its real, its important, and that if we could come up with a clean solution that we could probably re-use it for other purposes -- it should welcomed. I'm putting a  lot of emphasis on dead code and init mismatch issues as without this post I probably would not be able to talk to anyone about it and expect them to understand what I'm talking about, let alone have them understand the importance of the issue. The virtualization world is likely not the only place that could use a solution to some of the dead code concern problems. I'll soon be posting RFCs for a possible mechanism to help with this, if you want a taste of what this might look like, you can take a peak at the userspace table-init mockup solution that I've implemented. In short, its a merge of what the gPXE folks implemented with what Konrad worked on for IOMMU initialization, giving us the best of both worlds.

Wednesday, April 01, 2015

God complex - why open models will win

Engineering and science can never be about religion, they are both about trial and error, empirical evidence supporting trials, precision, and formulating math behind all this. Its really easy to forget this though, specially if you've hired really good engineers / scientists. With good engineers / scientists you might cut corners or simply expect and assume that you'll always have the best answers possible on board. A good thesis can only be good if it really covered all possible known grounds and is providing an in depth analysis that likely was never considered before. See my article and review of the Big Bang theory for my high bar expectation for what I mean by good scienceBecause of all this with the rapid pace of change in science and technology, knowledge and information flow I suspect there should be a limit at which closed development models can outpace open development models, although I have no evidence for this I believe the reasoning for this should be relatively trivial to follow. Folks who disagree with this might find it harder to prove the counter, which leaves me content without having to provide a full proof. I have found that this particular issue in Engineering / Science has been best described by Tim Harford in a Ted Talk titled "God Complex" and highly encourage anyone who might have hesitation about the above "open model outpacing closed models" premise to go watch it. I'll use this premise in this post, just an example, to argue that for instance, open hardware development should outpace closed hardware development models -- just as open software development models very likely already outpace closed proprietary software development models (we can't prove this as we don't have math on private development models). I'll go into details of my conjecture next and provide a brief guideline to folks who want to test this conjecture on open hardware development.

Engineering is not supposed to be easy, its fucking hard, and if you have it any other way you're fooling yourself that what you are doing is Engineering. Kernel development is not supposed to be easy, and considering that on Linux we're engaging with the entire planet openly on the largest collaborative development project on the planet, its no surprise that the engineering on Linux has a steeper curve than other average software engineering projects. Even though we've prided ourselves on informality on much of our engineering practices over time our growing pains have taught us a few principles and best practices to help us both scale and to more effectively engineer collaboratively. A few easy to follow examples of this are:

  • The practice of using Subsystem Maintainers, where components of parts of our software are broken down into components and folks then are in charge to upkeep that component. Linus just pulls all the strings of all maintainers together during the merge window.
  • The Development of the Developer Certificate of Origin (DCO) whereby after some legal considerations we realized its best to throw in some Signed-off-by / provenance guarantees on software in such a way that it would allow us to upkeep our pace of development.
  • A Code of Conflict to enable us to deal with unfortunate extreme mishaps on the outright difficult nature of engaging with grumpy overloaded maintainers and community on the open peer review process.
Many software projects have learned from Linux. The Subsystem Maintainers model is prevalent, although likely not invented on Linux, but as I've described in a previous post before the DCO is also heavily embraced by other projects already and other projects are encouraged to use it now thanks to our effort to separate it from Linux. Many projects have Code of Conflicts agreements, that is not unique to Linux. There's one aspect about the Code of Conflict that is important to highlight and goes in only as implicit but that I'd like now to make explicit and use as a primary premise for the reason for this post. Here is the language I'd like to highlight:
Your code and ideas behind it will be carefully reviewed, often resulting in critique and criticism.  The review will almost always require improvements to the code before it can be included in the kernel.  Know that this happens because everyone involved wants to see the best possible solution for the overall success of Linux.
I'm going to summarize this as: Engineering is hard as fuck, expect people to call you out on your shit. Deal with it, but if you feel we're unreasonable you can tap out. But most importantly: Expect your first iteration on ideas to likely not be correct and require improvements. Even the most seasoned developers should expect this. Before working for a purely software company I used to work at a hardware company, Atheros, and the role I engaged in was unique given that Atheros was providing full ASIC silicon designs on 802.11 technologies without requiring any CPU on the devices themselves. This meant that contrary to most 802.11 devices in the industry we worked without any firmware, all operations of the device were completely transparent to the device driver. Since I worked on an open device driver that meant all 802.11 hardware operations were completely open and transparent to the community whereby device drivers that relied and used on firmware would have hardware operations performed behind the scenes offloaded on the device's own CPU / proprietary firmware. Before I joined Atheros I used to believe that Atheros had the best 802.11 hardware in the industry. After I joined Atheros and particularly, as other peers got hired by other 802.11 silicon companies and we collaborated, I became convinced that it was not just Atheros' unique hardware that made it stand out.

The success to the quality of support of Atheros' 802.11 devices can also be attributed to:

  1. The full ASIC design nature of it (not requiring firmware) and how hardware issues were punted out to the device driver that made the device operate much better than others
  2. A strong community commitment / know-how and engagement
One thing which I'd like to highlight from the above graph is that at times the community was performing more contributions to the ath9k device driver than Atheros (later known as QCA). Both of the above are instrumental for a healthy openly developed device driver but I cannot stress enough how critical to success it was for not requiring firmware. I told folks repeatedly that we should not feel embarrassed about having hardware bugs. We should accept this as part of the nature of hardware design and silicon development. Its the rate at which you can fix these, even if through software workarounds, which will ultimately really create the best experience for users. If you have firmware the pipeline for fixes requires engaging with a team of engineers inside a company, and the time to fix issues there typically requires a significant amount of time. Without firmware even the community was able to participate in creating fixes for extremely complex issues, and this is extremely important for complex technologies such as 802.11. As we combine more RF technologies and things get more complex we will have no other option to work and engage with the community, thinking anything contrary to this make you fumble and fall into the "God complex" trap.

At Atheros, during the good' ol days, we were able to leverage off of the belief that we'd gain more successful contributions / healthy development model by opening up firmware on other devices where firmware was actually needed, we first tested this with carl9170 and later with ath9k_htc, both of which did require firmware but for which we managed to open source its firmware for. I believed our efforts to be pivotal, and an engaged open enthusiast reader might wish to perform metrics on carl9170 and ath9k_htc to help evaluate the impact of quality on software over openness.

At the last Linux wireless summit that I actively participated in, before joining SUSE, it was made clear that all manufacturers were moving away from full ASIC designs for 802.11 and that all silicon companies were going to be using proprietary firmware. There are a lot of reasons for this, some of this has to do with the combination of different RF technologies (not just 802.11), but nevertheless the saddest part to me of all this was that the good lessons learned from the success of fully open drivers and open firmware models were not being seriously considered by future 802.11 device drivers and architectures. Part of this is the above arguments for "goodness" has no direct hard science associated with it, its why I ended up working towards a hard science for ethical attributes.

Lacking hard science for proof for "goodness" might seem like a bad thing, but its also a chance for great opportunity. New startups and folks designing new hardware who already "get it" and do not have any hard requirements to tie themselves down with legacy archaic business requirements have a full open arena for exploration, this is the best situation to be in. Venture capitalism should be easily able to prove my conjecture by a few simple test cases. At least within the realm of open hardware designs, since existing silicon companies (not startups) might face the dangers of free software, they should consider using hoards of unused / closeted / legacy designs and testing new innovative approaches with the community. And then there's the folks who have been perfecting collaborative development models: companies / organizations which have already been perfecting open collaborative development models have much to bring to the table to new startups / business models which perhaps never had explored such things. There's room for a lot of experimentation and trial and error. I'm happy for my conjecture to be disproved given that all this is not about religion, but rather the best fucking engineering possible. I remain optimistic though.

Thursday, March 05, 2015

VMware law suit and an Apology to the BSD camp

I started hacking on Linux without any consideration for software licensing, I did it more out of pure joy out of getting a kick out of seeing hardware work which didn't before and collaborating with an amazing set of folks. Through my years on working on Linux though I've somehow stumbled on the front lines of licensing debacles both due to reverse engineering, copyright infringement claims, and later patent considerations. The only way I can explain why I kept working on things despite its debacles is that perhaps most people give up and I guess I just don't. Its been years now since working in the community, and in fact for a while even went on a hippy 'FreeBSD / Linux lets work together kumbaya!' with real technical solutions in place (part I, part II) -- on this post I'd like to provide some background and explain why I now fully support the GPL on Linux, believe its critical to enforce GPL on Linux and would like to declare I've given up on working on permissive licensed drivers on Linux and explain why. I write this to also explain in details why I fully support Christoph Hellwig's lawsuit against VMware filed today.

I've gone into details before about how I first got involved with hacking on Linux just to get my damn wifi to work, later jumped onto the MadWifi project and so began the 'ath5k wars'. Later as we put out the ath9k device driver I also engaged with Adrian Chadd from the FreeBSD camp quite a bit, eventually we ended up becoming coworkers and did our best on ending proprietary drivers for good by working together somehow. To prefix this I had called out to my Linux peers that we should consider simply localizing the GPL and look to work and engage collaboratively with the BSD camp. To this day I stand behind the technical ideas we put out together to share drivers for both BSD and Linux -- in the end the pitfalls however were what really set this effort back. I'll summarize them as follows:

  • Software teams at companies who do care about proprietary and permissively licensed solutions tend to be super sloppy and in no way motivated to do much work
  • Compared to the size of the amount of Linux developers the BSD camp stood no chance to help to keep up with what we were doing or putting out
  • Given the above issues the real folks who stand to gain from a joint venture between BSD and Linux folks working together on device drivers are simply the proprietary vendors selling proprietary solutions
  • Patents are a wild card, and best we have them on our side
  • Proprietary vendors with patent interests will play their cards carefully and you are at their mercy

I've written in a trilogy on my reasoning over the real the dangers over Free Software (GPL, Copyleft), the patents problem, and evolution of copyleft and business models. With the above problems and the points I made in the trilogy in mind -- here is my apology to the BSD camp: as much as I'd like to help my BSD counterparts I now consider permissive licenses, especially ones that do not consider patents, brutally archaic, and do not see a way forward with them. While there might be some ambiguity with GPLv2 and patents we are at least upholding some more modern collaborative development best practices which should help uplift our community. While evolving copyleft has not been easy (see my notes on why the GPLv3 really failed with kernels folks) we still have the chance to help evolve copyleft in the right way -- openly and with the community. We should be allying ourselves in the community with those companies who are actively engaged on evolving copyleft and the commons for better of the community (hey SUSE's hiring); when and if companies decide to cut corners -- simply quit and seek to try to ensure that they meet their fate in a court of law some day.

It seems VMware has done nothing in any way like the work I did to help with the use of permissive licensed drivers on Linux, which would likely be the minimum expected for some coexistence even with proprietary platforms without raising any eyebrows. Trust me, it was not easy work and just above I've declared I've given up on that and consider it pointless. Despite the best efforts by Conservancy to try to ask nicely to address the problem VMware has decided to opt out and play their cards. It seems VMware is trying to cut corners and reap benefits from our ecosystem on Linux in broad daylight. That's a bloody shame. Best of luck to Christoph, I fully support him on this lawsuit against VMware, if you feel the same I would like to encourage you to donate to Conservancy to support the VMware lawsuit, if you are a Linux kernel developer and share your sentiments consider joining the loose knit set of kernel developers under Conservancy wishing to seek GPL compliance on Linux, you can email compliance@sfconservancy.org for further information.