Thursday, December 10, 2015

Avoiding dead code: pv_ops is not the silver bullet


This is part I - for part II - see "Xen and the Linux x86 zero page"

"Code that should not run should never run"



The fact that code that should not run should never run seems like something stupid and obvious but it turns out that its actually easier said than done on very large software projects, particularly on the Linux kernel. One term for this is "dead code". The amount of dead code on Linux has increased over the years due to the desire by Linux distributions to want a single Linux kernel binary to work on different run time environments. The size and complexity of certain features increases the difficulty of proving that dead code never runs. Using a single kernel binary is desirable given that the alternative is we'd have different Linux kernel binary packages for each major custom run time environment we wish to use and among other things this means testing and validating multiple kernels. A really complex modern example, which this post will focus on, is dead code which is possible as a consequence of how we handle support for different hypervisors on the Linux kernel. The purpose of this post is to create awareness about the problem, clean resolutions to these problems have been already integrated upstream for a few features, and you should be seeing a few more soon.


Back in the day you needed a custom kernel binary if you wanted to use the kernel with specific hypervisor support. To solve this the Linux kernel paravirtualization operations, aka paravirt_ops, or even shorter just pv_ops, was chosen as the mechanism to enable different hypervisor solutions to co-exist with a single kernel binary. Although pv_ops was welcomed with open arms back in the days as a reasonable compromise, these days just the mention of "pv_ops" to any kernel developer will cause a cringe. There are a few reasons to hate pv_ops these days, given the praise over it back in the day its perhaps confusing why people hate them so much now, this deserves some attention. Below are a few key reasons why developers hate pv_ops today.

  • pv_ops was designed at a time when hardware assisted virtualization solutions were relatively new, and it remained unclear how fully paravirtualized solutions would compare. KVM is a hypervisor solution that requires hardware assisted virtualization. These days, even originally fully paravirtualized hypervisors solutions such as the Xen hypervisor have integrated support the hardware virtualization extensions put out by several hardware vendors. This makes it difficult to term hypervisors that no longer are "fully paravirtualized", the different possibilities of what could be paravirtualized and be dealt with by hardware has given the rise to a slew of different types of paravirtualized guests. For instance, Xen now has PV, HVM, PVH, check out the virtualization spectrum page for a clarification of how each of these vary. What remains clear though is hardware assisted virtualization features have been welcomed and in the future you should count on all new systems running virtualization to take advantage of them. In the end Xen PHV will provide that sweet spot for the best mixture of "paravirtualization" and hardware virtualization. Architectures which needed hypervisor virtualization support developed after hardware assisted virtualization solutions were in place can support different hypervisors without pv_ops. Such is the case for ARM which supports both KVM and Xen on ARM. In this light, in a way pv_ops is a thing of the past. If Xen slowly deprecates and finally removes fully paravirtualized PV support from the Linux kernel Konrad has noted that at the very least we could deprecate pv_ops MMU components.
  • Although pv_ops was conceived as an architecture agnostic solution in order to support different hypervisors, since hardware assisted virtualization solutions are common, and since evidence shows you can support different hypervisors cleanly without pv_ops and Xen support on ia64 was removed and deprecated, and so pv_ops was also removed from ia64x86 is now the only remaining architecture using pv_ops.
  • Collateral: changes to pv_ops can cause regressions and can impact code for all x86-64 kernel solutions, as such kernel developers are extremely cautious on making additions, extensions, and of even adding new users. To what extent do we not want extensions to pv_ops? Well Rusty Russell wrote the lguest hypervisor and launcher code, he did this to not only demo pv_ops but also set sanity on how folks should write hypervisors for Linux using pv_ops. Rusty wrote this with only 32-bit support though. Although there has been interest in developing 64-bit support on lguest, its simply not welcomed, for at least one of the reasons stated above -- as per hpa: "extending pv_ops is a permanent tax on future development". With the other reasons listed above, this is even more so. If you want to write a demo hypervisor with 64-bit support on x86 the approach you could take is to try to write it with all the fancy new hardware virtualization support and you should try avoiding pv_ops as much as is humanely possible.

So pv_ops was originally the solution put in place to help support different hypervisors on Linux through an architecture agnostic solution. These days, provided we can phase out full Xen PV support, we should strive to only keep what we need to provid support for Xen PHV and the other hardware assisted hypervisors.


The only paravirtualized hypervisors supported upstream on the Linux kernel are Xen for PV guest types (PV, PVH) and the demo lguest hypervisor. lguest is just demo code though, I'd hope no one is using it in production code though... I'd be curious to hear... Assuming no one sane is using lguest as a production hypervisor and we could phase it out, that leaves us with Xen PV solutions as the remaining solution to study to see how we can simplify pv_ops. A highly distinguishing factor of Xen PV guest types (Xen PV, Xen PVH) are that they have a unique separate entry point into Linux when Linux on x86 boots. Xen PV and Xen PVH guest types share this same entry path. That is, even if we wanted to try to remove as much as possible from pv_ops, we'd still currently have to take into account that Xen's modern ideal solution with a mixture of "paravirtualization" and hardware virtualization uses this separate entry path. Trying to summarize this without going into much detail, the different entry points and how x86-64 init works can be summarized as follows.

Bare metal, KVM, Xen HVM                      Xen PV / dom0
    startup_64()                             startup_xen()
           \                                     /
   x86_64_start_kernel()                 xen_start_kernel()
                        \               /
                   x86_64_start_reservations()
                                |
                           start_kernel()
                           [   ...        ]
                           [ setup_arch() ]
                           [   ...        ]
                               init

Although this is a small difference, it actually can have a huge impact on possible "dead code". You see, prior to pv_ops different binaries were compiled, and features and solutions which you knew you would not need could simply be negated via Kconfig, these negations were not done upstream -- they were only implemented and integrated on SUSE kernels, as SUSE was perhaps the only enterprise Linux distribution fully supporting Xen. Doing these negations ensures that code we determined should never run, never got compiled in. Although this Kconfig solution was never embraced upstream it doesn't mean the issue didn't exist on upstream, quite the contrary, it obviously did, there was just no clean proposed solution to the problem and frankly no one cared too much about resolving it properly. However an implicit consequence of embracing pv_ops and supporting different hypervisors with one binary is that we're now forced to have large chunks of code always enabled in the Linux kernel, some of which we know should not run once we know what path we're taking on the above tree init path. Code cannot be compiled out, as our differences are now handled at run time. Prior to pv_ops the Kconfig solution was used to negate feature that should not run when on Xen so issues would come up at compile time and could be resolved this way. This Kconfig solution was in no way a proactive solution, but its how Xen support on SUSE kernels was managed. Using pv_ops means we need this resolved through alternative upstream friendly means.


Next are a just a few examples of dead code concerns I have looked into but please note that there are more, I also explain a few of these. Towards the end I explain what I'm working on to do about some of these dead code concerns. Since I hope to have convinced you that people hate pv_ops, the challenge here is to come up with a really clean generic solution that 1) does not extend pv_ops, and 2) could also likely be repurposed for other areas of the kernel.
  • MTRR
  • IOMMU - initialization (resolved cleanly), IOMMU API calls, IOMMU multifuction device conflict. exposed IOMMU ACPI tables (Intel VT-d), 
  • Microcode updates - both early init and changes at run time
As I've studied some of the dead code concerns for some of the above features I've also identified an issue when the main x86 entry path is modified for x86-64 but the Xen's init path is forgotten. When this happens in the worst case you end up crashing Xen. I list two of these cases, one of which is still an issue for Xen. I call these init mismatch issues.
  • cr4 shadow
  • KASan
So both dead code concerns, and init mismatch issues can break things, sometimes really really badly. Some of the solutions in place today and some that will be developed are what I like to refer to as paravirtualization yielding solutions. When reviewing some of these issues below, keep in mind that this is essentially what we're doing, it should help you understand why we're doing what we're doing, or why we need some more work in certain areas of the kernel.


Death to MTRR:


MTRR is an example type of code that we know should not run on when we boot Linux for Xen dom0 or as a guest given that on Linux upstream we never implemented a solution to deal with MTRR with the hypervisor. MTRR calls however are a case that in most cases are not fatal if they fail, typically if MTRR calls fail you'd suffer performance. Since MTRR is really old, we had the option to either add MTRR Linux hypervisor call support for Xen, or work on an alternative that avoided MTRR somehow amicably. Fortunately a long time ago Andy Lutomirski figured we could replace direct MTRR calls with a no-op when on PAT capable systems, provided you also used a PAT friendly respective ioremap call. So he added arch_phys_wc_add() to be used in combination with ioremap_wc(). This solved it for write-combining MTRR calls. He did a bit of the driver conversions needed for this work, it however was never fully completed. If you're following my development upstream you may have noticed that among other things for MTRR I completed where Andy left off, replacing all direct users of write combining MTRR calls upstream on Linux with an architecture agnostic write-combining call, arch_phys_wc_add(), in combination with ioremap_wc(). Instead of adding Linux MTRR hypervisor calls we now have a wrapper which will call MTRR only when we know that is needed, and instead PAT interfaces are used when available. Addressing write-combining MTRR is just one small example though of what we needed to address, there are other types of MTRRs you could use, and in the worst cases they were being used in incredibly hackish, but functional ways. For instance in one case one driver was using two overlapping MTRRs, in the worst case PCI Bar was of 16  MiB but the MMIO region for the device was in the last 4 KiB of the same PCI BAR. You want to avoid write-combining on MMIO regions, but if we use one MTRR for write-combining without affecting the MMIO region we'd end up with 8 MiB of write-combining and loose out on the rest of graphics memory. Using a 16 MiB write-combining MTRR meant we'd write-combine the MMIO region.. The implemented hacky MTRR solution was to issue a 16 MiB write-combining MTRR followed by 4 KiB UC MTRR. There were also two overlapping ioremap calls for this driver. The resolution, in a PAT friendly way included adding ioremap_uc() upstream, which would set PCD=1, PWT=1 on non-PAT systems and use a PAT value of UC for PAT systems. We used this for the MMIO region, doing this ensures that if you then issue on MTRR on this region the MMIO region would remain unaffected. The framebuffer was also carved out cleanly, and ioremap_wc() used on it. For details refer to:

x86/mm, asm-generic: Add IOMMU ioremap_uc() variant default
drivers/video/fbdev/atyfb: Carve out framebuffer length fudging into a helper
drivers/video/fbdev/atyfb: Clarify ioremap() base and length used
drivers/video/fbdev/atyfb: Replace MTRR UC hole with strong UC
drivers/video/fbdev/atyfb: Use arch_phys_wc_add() and ioremap_wc()

But that's not all... even if all drivers have been converted over to never issue MTRR calls directly the BIOS might still issue MTRRs on bootup, and the kernel should have to know about that to avoid issues with conflicts with PAT. More work on this front is therefore needed, but at least the crusade to remove direct access to MTRR was completed on Linux as of v4.3.

IOMMU:


A really clean solution to dead code, although it wasn't the only reason for why this went upstream, came from how IOMMU initialization code was handled with IOMMU_INIT macros with struct iommu_table_entry. The solution in place had to account for different dependencies between IOMMU code, this dependency map is best explained by a diagram.

            [xen-swiotlb]
                 |
         +----[swiotlb *]--+
        /         |         \
       /          |          \
    [GART]     [Calgary]  [Intel VT-d]
     /
    /
 [AMD-Vi]

Dependencies are annotated, detection routines made available and there's a sort routine which makes this execute in the right order. The full dependency map is handled at run time, to review some of the implementation check out git log -p 0444ad93e..ee1f28, and just check out the code. When this code was proposed hpa had actually suggested that this sort of problem is common enough that perhaps a generic solution could be implemented on Linux, and that the solution developed by the gPXE folks might be a good one to look at. As neat as this was, this still doesn't address all concerns though. Expect to see some possible suggested updates in this area.

Microcode updates:


A CPU often needs software updates, this is known as CPU microcode updates. If using a hypervisor though your hypervisor should take care of these updates for you as a guest should not have to fix real hardware. Additionally if you do enable a guest to do updates on behalf of a full system you may want to be selective about what guests are allowed to do this. Then there are the run time update considerations. Some CPU microcode updates might disable some CPU ops, if you do this on a hypervisor with code already running some code might break as it assumes some CPU ops still are valid. This could cause some unexpected situations for guests. Doing run time CPU microcode updates after a system has booted then should be avoided and only done if you are 100% certain you can do it, and you have full hardware and software vendor support for it. The CPU microcode update must be designed for a run time update. As far as Linux is concerned we avoid enabling CPU microcode updates by bailing out on the CPU microcode init code if pv_enabled() returns true. This works but it turns out this is not an ideal solution, the reason is that pv_enabled() really should probably be renamed to something such as pv_legacy() as this really only returns true if you have a legacy PV solution. Expect some updates on this upstream soon. If folks desire run time CPU microcode updates on Xen work is required on the Xen side to copy the buffer to Xen, scan the buffer for the correct patch, and finally rendezvous all online cpus in an IPI to apply the patch, and keep the processors in until all have completed the patch. I hacked up a version for the hypervisor which just does queiscing by pausing domains, that obviously needs more work, someone interested should pick up on that. Refer to Xen microcode updates for Xen specific documentation or to read the latest notes on developing this for the Xen hypervisor. At this time, its not clear where KVM keeps this documentation.

Init mismatch issues:


We have a dual entry with x86, we have to live with that now, but at times this is overlooked and it can happen to the best of us. For instance, when Andy Lutomirski added support to shadow the CR4 per CPU on the x86-64 init path he forgot to add a respective call for Xen. This caused a crash on all Xen PV guests and dom0. Boris Ostrovsky fixed this for 64-bit PV(H) guests. I'm told code review is supposed to catch these issues but I'm not satisfied, the fix here was purely reactive. We could and should do better. A perfect example of further complications is when Linux got KASan support, the kernel address sanitizer. Enabling KASan on x86 will crash Xen today, and this issue is not yet fixed. We need a proactive solution. If we could unify init paths, would that help? Would that be welcomed? How could that be possible?


What to do


The purpose of this post is to create awareness of what dead code is, make you believe its real, its important, and that if we could come up with a clean solution that we could probably re-use it for other purposes -- it should welcomed. I'm putting a  lot of emphasis on dead code and init mismatch issues as without this post I probably would not be able to talk to anyone about it and expect them to understand what I'm talking about, let alone have them understand the importance of the issue. The virtualization world is likely not the only place that could use a solution to some of the dead code concern problems. I'll soon be posting RFCs for a possible mechanism to help with this, if you want a taste of what this might look like, you can take a peak at the userspace table-init mockup solution that I've implemented. In short, its a merge of what the gPXE folks implemented with what Konrad worked on for IOMMU initialization, giving us the best of both worlds.
Post a Comment