Linux asynchronous probe - let's try this again

December 22, 2015

Linux asynchronous probe - let's try this again

Updated on 2016-01-19 with description on issue of how systemd limits the number of devices on a Linux system and references to asynchronous work on memory. Edits reflected in this color.

Hipster and trendy init systems want to boot really fast. As of v4.2 the Linux kernel now sports asynchronous probe support (this fix posted December 19, 2015 is needed for use of the generic async_probe module parameter). This isn't the first time such type of work has been attempted on Linux though, this lwn article claims that a long time ago some folks tried to enable asynchronous probe and that ultimately it was reverted due to large number of issues. Among a few things one major difference with the new solution is its opt-in: userspace or drivers must specifically request for it to be used on a driver, We also support blacklisting of asynchronous behavior by annotating a driver requires synchronous probe. All this enables new shiny hipster userspace, while remaining compatible with old userspace and its expectations. At the 2015 Kernel summit it became apparent a few folks still had questions over this, so I decided to write this to help re-cap why the work was done, caveats, its relationship with using -EPROBE_DEFER on your probe routine for making use of the kernel's deferred probe mechanism, to help testing and productizing with asynchronous probe, and also explain a bit of the short term and long term road-map. This post also collects a bit of history of what gave rise to Linux asynchronous probe which I think we can use as a small educational experience on learning how we can better evolve systemd in the community.

So imagine you are wandering around the jungle and you come up on THIS around a corner. Ok... so he's only about an inch tall, but if you're a cherry tomato or grub that is bad news.

First to be clear -- asynchronous probe isn't supposed to magically make your kernel boot faster, it should however help if you happen to have any driver which for whatever reason tends to have a lot of work done on a driver's probe routine. Even if that's not the case at times using asynchronous probe can shave down kernel boot time even if minimally. Other times it may have no impact at all or perhaps you may see a small increase for any number of reasons. A clear but not obvious gain is the increase for the number of devices a device driver can support, this is explained below. Since this is a new feature we simply don't have enough metrics and enough test coverage yet to determine how helpful it can be so widely, or what issues could creep up, however it was clear some folks wanted and needed it. More importantly using it can also get driver developers and subsystem maintainers thinking about different asynchronous behavior considerations in the kernel that long term should help us in the community. An example is how although asynchronous probe should help with long probes we recently determined that you should by no means consider it as a solution for your probe routine if your driver needs to load firmware on probe and you may have experienced some race issues with this and the filesystem being mounted -- that problem need to be resolved separately (see this firmware_class feature enhancement wiki and this common kernel file loader wiki page for more details and ideas). In lieu of concrete bullet proof solutions for that problem you might be tempted to think asynchronous probe could help and you'd be correct but you should be aware of that this is not a rock solid solution to such problems, it'd be a hack, and this is why its incorrect to use asynchronous probe if you're trying to use it to fix that problem. Another example is how this begs the question of where else should we be using asynchronous mechanisms, and how do we resolve any possible run time dependency issues?

Asynchronous probe support was added for a few reasons, the last three listed here being the major driving factors for getting this developed and merged upstream.

Over time there's been a general interest in reducing the kernel's boot time
A long time ago in a galaxy far far away... systemd made a really well-intentioned but ultimately incorrect assumption that device driver initialization should take less than 30 seconds, to be more specific the driver's init routine should not take more than 30 seconds. Even as issues started to creep up quite a bit of systemd and kernel developers vocalized strong support for it being a reasonable timeout value. Some users were really upset over this though -- driver loading was being killed after 30 seconds, preventing some drivers from loading completely and in the worst cases if the driver at fault was a storage driver you would not even be able to boot Linux. Because of the strong agreement on both camps there was no exceptions to this rule, and the consensus seemed to be that a lot of drivers should simply be fixed. One puzzle was that issues over drivers being killed due to the timeout were only reported circa 2014, but the timeout was in place systemd for a long time. The reason for this was that commit 786235eeba0 by Tetsuo Handa ("kthread: make kthread_create() killable") enabled kthread_create() to be killed, this was done in particular to enable out of memory killers to kill these type of threads (refer to this lwn article for more details). Prior to this kernel change, the 30 second timeout was never an issue for systemd users given that the SIGKILL signal was never actually respected for these types of threads. Even though the Linux kernel now has asynchronous probe support the original systemd 30 second timeout caused enough headaches for users that on July 29, 2014 Hannes Reinecke ended up merging a way to enable Linux distributions to override the timeout through the command line, refer to systemd commit 9719859c07aa13 ("udevd: add --event-timeout commandline option"). That didn't seem to be enough to help users so on August 30, 2014 Kay Sievers bumped the timeout to 60 seconds via systemd commit 2e92633dbae ("udev: bump event timeout to 60 seconds"). In the end though, on September 10, 2014 Tom Gundersen modified the default timeout to 180 seconds via systemd commit b5338a19864a ("udev: timeout - increase timeout"), the purpose of the timeout, as per the commit log message, now is "to make sure that nothing stays around forever". To help capture in logs possible faulty drivers (or any jobs dispatched) Tom Gundersen also made systemd spit out a warning after 1/3 of the timeout value before killing it via systemd commit 671174136525ddf2 ("udev: timeout - warn after a third of the timeout before killing").
It turns out out though that... Linux batches calling a driver init routine and immediately after that its probe routine, synchronously, so naturally any delays on probe should contribute to delays as well. So the systemd timeout is in effect for the run time combination of both init and probe of a device driver. If we provide a way for userspace to ask the driver core to detach these and call probe asynchronously we'd be giving systemd what it thought, and a few kernel developers thought, was actually in place.
A delay on your probe means delaying user experience at boot time. If you know off hand your driver might take a while to load preemptively annotating this on your driver can mean giving users a better user experience. Dmitry Torokhov ran into this issue while working on productizing a solution for a popular company where fast boot and a good user experience was critical.
It turns out that... a systemd timeout on kmod loader (loading modules) has effect not only on the combination of init + probe of device drivers, but also since the kernel serially probes all devices in the same code path it means if you probe 2 devices the amount of time taken to load your driver will be init time + (number of devices * probe time for each device). What this means is the systemd timeout also places an upper bound limit restriction on the number of devices you can use on a system, this is bound by its init and probe time, and can be computed as follows:

number_devices =          systemd_timeout
                  -------------------------------------
                      max known probe time for driver

Drivers can be built-in to the kernel or built as modules so you can load them after the kernel boots as independent and self contained objects. It turns out that in practice striving towards having all modules be probed asynchronously tends to work pretty well, whereas having all built-in drivers will likely crash your kernel with high degree of certainty. This later issue has to do with the fact that as the kernel boots certain assumptions may be made which are not satisfied early on and there's no current easy way to currently order this well. Its similar to why the deferred probe mechanism on the kernel was added -- sometimes the kernel doesn't always have dependency information well sorted out. But fret not, future work should help with this, and such work should help curtail uses of deferred probing and enable more broad asynchronous probe use.

If you are in control of both hardware and software, that is you have engineers you can pay to productize a solution, you could likely engineer a solution to vet and ensure boot will happen properly and in order for both all built-in and modules on your kernel. There is no easy way to do this, and it is difficult to estimate the amount of work required for this for a device but if you want to try it -- you can use this out of tree debug-async patch and then use the kernel parameters documented there, I summarize them here. Note that using either of these will taint your kernel.

__DEBUG__kernel_force_builtin_async_probe - async probe all built-in drivers
__DEBUG__kernel_force_modules_async_probe - async probe all modules

If you don't have the luxury of having dedicated hardware and software engineers you could at the very least enable all modules to probe asynchronously and hope for the best and report any issues if found. Its after all what systemd, and what a lot of developers (many kernel developers inclusive), originally thought was happening, so naturally bug reports are welcomed to the driver maintainer if any issues occur. Soon you may see Linux distributions enabling asynchronous probe by default for all modules. The way I'd implement this on systemd is to enable a Linux distribution to opt-in to enable async_probe for specific kernels, given a fix is needed for using the generic async_probe module parameter though one should only ensure to use it if this fix has been merged. This makes it tricky to detect if the module parameter is properly supported or not, enabling it and booting on an older kernel might obviously cause a crash.

Getting drivers to load correctly is just one step, remember that prior to asynchronous probe some userspace expected some device functionality to be available immediately after loading a driver. With asynchronous probe that is no longer the case, userspace must be vetted and tested for to ensure they do not rely on synchronous loading of the drivers.

If you're a driver developer and know that your driver takes a while to boot, you should be aware that it can delay boot / user experience, so you likely should consider annotating on your driver that it prefers asynchronous probe in the driver's source. You can do so as follows:

static struct pci_driver foo_pci_driver = {
      ...
      .driver.probe_type = PROBE_PREFER_ASYNCHRONOUS,
};

An alternative (provided you have this fix merged) is to pass the generic "async_probe" module parameter to the module you want to load, for instance:

modprobe cxgb4 async_probe

Sadly some few drivers cannot work with asynchronous probe at all today, so after testing and if it poops out you should annotate this sort of hard incompatibility. You can do so as follows:

static struct pci_driver foo_pci_driver = {
       ...
       .driver.probe_type = PROBE_FORCE_SYNCHRONOUS,
};

It should be made clear that this sort of incompatibility should likely be seen more as an issue -- if your driver fails at using asynchronous probe chances are that the issues are some subtle architectural design flaw in the driver or dependencies. Fixing it may not necessarily be easy and its precisely for this reason why we have a such a flag to force synchronous probe. Our hope though is that with time we could phase these issues out.

Even if we manage to get all drivers working with asynchronous probe we cannot remove synchronous probe as old userspace exists which relies on such behavior, removing synchronous probe support would break old userspace. What we we can strive for long term though is to enable new userspace as best as possible and deal with all asynchronous issues as they come up, slowly, this will take time and serious effort. Over time you should be seeing more work in this area across subsystems, internals, and perhaps even architecture work. Just to give you a taste of and provide you an example of such type of work done, you review the recent asynchronous work by Mel Gorman on on memory on init through commits 1e8ce83cd17fd0f549a7ad145ddd2bfcdd7dfe37..0e1cc95b4cc7293bb7b39175035e7f7e45c90977, please note these also have a few follow-on fixes. Lastly, obviously some systemd design decisions should be taken with a grain of salt, but they seem to be very well-intentioned, we could use a bit more open and objective communication and design review between more kernel developers and systemd developers. The smoother this gets the smoother the experience we provide to users should be.