This is part I - for part II - see "Xen and the Linux x86 zero page"
"Code that should not run should never run"
The fact that code that should not run should never run seems like something stupid and obvious but it turns out that its actually easier said than done on very large software projects, particularly on the Linux kernel. One term for this is "dead code". The amount of dead code on Linux has increased over the years due to the desire by Linux distributions to want a single Linux kernel binary to work on different run time environments. The size and complexity of certain features increases the difficulty of proving that dead code never runs. Using a single kernel binary is desirable given that the alternative is we'd have different Linux kernel binary packages for each major custom run time environment we wish to use and among other things this means testing and validating multiple kernels. A really complex modern example, which this post will focus on, is dead code which is possible as a consequence of how we handle support for different hypervisors on the Linux kernel. The purpose of this post is to create awareness about the problem, clean resolutions to these problems have been already integrated upstream for a few features, and you should be seeing a few more soon.
Back in the day you needed a custom kernel binary if you wanted to use the kernel with specific hypervisor support. To solve this the Linux kernel paravirtualization operations, aka paravirt_ops, or even shorter just pv_ops, was chosen as the mechanism to enable different hypervisor solutions to co-exist with a single kernel binary. Although pv_ops was welcomed with open arms back in the days as a reasonable compromise, these days just the mention of "pv_ops" to any kernel developer will cause a cringe. There are a few reasons to hate pv_ops these days, given the praise over it back in the day its perhaps confusing why people hate them so much now, this deserves some attention. Below are a few key reasons why developers hate pv_ops today.
- pv_ops was designed at a time when hardware assisted virtualization solutions were relatively new, and it remained unclear how fully paravirtualized solutions would compare. KVM is a hypervisor solution that requires hardware assisted virtualization. These days, even originally fully paravirtualized hypervisors solutions such as the Xen hypervisor have integrated support the hardware virtualization extensions put out by several hardware vendors. This makes it difficult to term hypervisors that no longer are "fully paravirtualized", the different possibilities of what could be paravirtualized and be dealt with by hardware has given the rise to a slew of different types of paravirtualized guests. For instance, Xen now has PV, HVM, PVH, check out the virtualization spectrum page for a clarification of how each of these vary. What remains clear though is hardware assisted virtualization features have been welcomed and in the future you should count on all new systems running virtualization to take advantage of them. In the end Xen PHV will provide that sweet spot for the best mixture of "paravirtualization" and hardware virtualization. Architectures which needed hypervisor virtualization support developed after hardware assisted virtualization solutions were in place can support different hypervisors without pv_ops. Such is the case for ARM which supports both KVM and Xen on ARM. In this light, in a way pv_ops is a thing of the past. If Xen slowly deprecates and finally removes fully paravirtualized PV support from the Linux kernel Konrad has noted that at the very least we could deprecate pv_ops MMU components.
- Although pv_ops was conceived as an architecture agnostic solution in order to support different hypervisors, since hardware assisted virtualization solutions are common, and since evidence shows you can support different hypervisors cleanly without pv_ops and Xen support on ia64 was removed and deprecated, and so pv_ops was also removed from ia64, x86 is now the only remaining architecture using pv_ops.
- Collateral: changes to pv_ops can cause regressions and can impact code for all x86-64 kernel solutions, as such kernel developers are extremely cautious on making additions, extensions, and of even adding new users. To what extent do we not want extensions to pv_ops? Well Rusty Russell wrote the lguest hypervisor and launcher code, he did this to not only demo pv_ops but also set sanity on how folks should write hypervisors for Linux using pv_ops. Rusty wrote this with only 32-bit support though. Although there has been interest in developing 64-bit support on lguest, its simply not welcomed, for at least one of the reasons stated above -- as per hpa: "extending pv_ops is a permanent tax on future development". With the other reasons listed above, this is even more so. If you want to write a demo hypervisor with 64-bit support on x86 the approach you could take is to try to write it with all the fancy new hardware virtualization support and you should try avoiding pv_ops as much as is humanely possible.
So pv_ops was originally the solution put in place to help support different hypervisors on Linux through an architecture agnostic solution. These days, provided we can phase out full Xen PV support, we should strive to only keep what we need to provid support for Xen PHV and the other hardware assisted hypervisors.
Bare metal, KVM, Xen HVM Xen PV / dom0 startup_64() startup_xen() \ / x86_64_start_kernel() xen_start_kernel() \ / x86_64_start_reservations() | start_kernel() [ ... ] [ setup_arch() ] [ ... ] init
Although this is a small difference, it actually can have a huge impact on possible "dead code". You see, prior to pv_ops different binaries were compiled, and features and solutions which you knew you would not need could simply be negated via Kconfig, these negations were not done upstream -- they were only implemented and integrated on SUSE kernels, as SUSE was perhaps the only enterprise Linux distribution fully supporting Xen. Doing these negations ensures that code we determined should never run, never got compiled in. Although this Kconfig solution was never embraced upstream it doesn't mean the issue didn't exist on upstream, quite the contrary, it obviously did, there was just no clean proposed solution to the problem and frankly no one cared too much about resolving it properly. However an implicit consequence of embracing pv_ops and supporting different hypervisors with one binary is that we're now forced to have large chunks of code always enabled in the Linux kernel, some of which we know should not run once we know what path we're taking on the above tree init path. Code cannot be compiled out, as our differences are now handled at run time. Prior to pv_ops the Kconfig solution was used to negate feature that should not run when on Xen so issues would come up at compile time and could be resolved this way. This Kconfig solution was in no way a proactive solution, but its how Xen support on SUSE kernels was managed. Using pv_ops means we need this resolved through alternative upstream friendly means.
Next are a just a few examples of dead code concerns I have looked into but please note that there are more, I also explain a few of these. Towards the end I explain what I'm working on to do about some of these dead code concerns. Since I hope to have convinced you that people hate pv_ops, the challenge here is to come up with a really clean generic solution that 1) does not extend pv_ops, and 2) could also likely be repurposed for other areas of the kernel.
- IOMMU - initialization (resolved cleanly), IOMMU API calls, IOMMU multifuction device conflict. exposed IOMMU ACPI tables (Intel VT-d),
- Microcode updates - both early init and changes at run time
- cr4 shadow
So both dead code concerns, and init mismatch issues can break things, sometimes really really badly. Some of the solutions in place today and some that will be developed are what I like to refer to as paravirtualization yielding solutions. When reviewing some of these issues below, keep in mind that this is essentially what we're doing, it should help you understand why we're doing what we're doing, or why we need some more work in certain areas of the kernel.
Death to MTRR:
x86/mm, asm-generic: Add IOMMU ioremap_uc() variant default
drivers/video/fbdev/atyfb: Carve out framebuffer length fudging into a helper
drivers/video/fbdev/atyfb: Clarify ioremap() base and length used
drivers/video/fbdev/atyfb: Replace MTRR UC hole with strong UC
drivers/video/fbdev/atyfb: Use arch_phys_wc_add() and ioremap_wc()
But that's not all... even if all drivers have been converted over to never issue MTRR calls directly the BIOS might still issue MTRRs on bootup, and the kernel should have to know about that to avoid issues with conflicts with PAT. More work on this front is therefore needed, but at least the crusade to remove direct access to MTRR was completed on Linux as of v4.3.
[xen-swiotlb] | +----[swiotlb *]--+ / | \ / | \ [GART] [Calgary] [Intel VT-d] / / [AMD-Vi]
Dependencies are annotated, detection routines made available and there's a sort routine which makes this execute in the right order. The full dependency map is handled at run time, to review some of the implementation check out git log -p 0444ad93e..ee1f28, and just check out the code. When this code was proposed hpa had actually suggested that this sort of problem is common enough that perhaps a generic solution could be implemented on Linux, and that the solution developed by the gPXE folks might be a good one to look at. As neat as this was, this still doesn't address all concerns though. Expect to see some possible suggested updates in this area.
Init mismatch issues:
We have a dual entry with x86, we have to live with that now, but at times this is overlooked and it can happen to the best of us. For instance, when Andy Lutomirski added support to shadow the CR4 per CPU on the x86-64 init path he forgot to add a respective call for Xen. This caused a crash on all Xen PV guests and dom0. Boris Ostrovsky fixed this for 64-bit PV(H) guests. I'm told code review is supposed to catch these issues but I'm not satisfied, the fix here was purely reactive. We could and should do better. A perfect example of further complications is when Linux got KASan support, the kernel address sanitizer. Enabling KASan on x86 will crash Xen today, and this issue is not yet fixed. We need a proactive solution. If we could unify init paths, would that help? Would that be welcomed? How could that be possible?
What to do
The purpose of this post is to create awareness of what dead code is, make you believe its real, its important, and that if we could come up with a clean solution that we could probably re-use it for other purposes -- it should welcomed. I'm putting a lot of emphasis on dead code and init mismatch issues as without this post I probably would not be able to talk to anyone about it and expect them to understand what I'm talking about, let alone have them understand the importance of the issue. The virtualization world is likely not the only place that could use a solution to some of the dead code concern problems. I'll soon be posting RFCs for a possible mechanism to help with this, if you want a taste of what this might look like, you can take a peak at the userspace table-init mockup solution that I've implemented. In short, its a merge of what the gPXE folks implemented with what Konrad worked on for IOMMU initialization, giving us the best of both worlds.