Monday, December 14, 2015

Xen and the x86 Linux zero page

This is part II, for part I - refer to "Avoiding dead code: pv_ops is not the silver bullet".

On x86 Linux the boot sequence is rather complicated, so much so that it has its own dedicated boot protocol. This is documented upstream on Documentation/x86/boot.txt. The protocol tends to evolve as the x86 architecture evolves, in order to compensate for new features or extensions which could we need to learn about at boot time. Of interest to this post is the "zero page". The first step when loading a Linux kernel is to load the "zero page", this consists of a the structure struct boot_params, defined in arch/x86/include/uapi/asm/bootparam.h. Its called zero page as unless you're relocating data around, the the zero page is the first physical page of the operating system. The x86 boot protocol originally only had to support 16-bit boot protocol, to do this it required first to load the real-mode code (boot sector and setup code). For modern bootloaders what needs to be loaded is a bit larger, but new bootloaders must still load the same original real-mode code. The struct boot_params accounts for this evolution in requirements, the real-mode section is what is defined in the struct setup_header. The zero page is not only something which we must load, its also part of the actual bzImage we build on x86. One can therefore read a kernel file's struct boot_params as well to extract some details of the kernel. To try this you can play around with parse-bzimage, part of the table-init tree on github. All this sort of stuff is what bootloaders end up working with. Since hypervisors can also boot Linux they must also somehow do the same. This post is about how Xen's zero-page setup design, we'll contrast it to lguest's zero page setup. lguest is a demo 32-bit hypervisor on Linux.

If a hypervisor boots Linux it must also set up the zero page. We'll disect Xen's set up of the zero page backwards, from tracing what we see on Linux down to Xen's setup of the zero page. Xen's entry into Linux x86 for PV type guests (PV, PVH) is set up and annotated on the ELF binary as an ELF note, in particular the XEN_ELFNOTE_ENTRY. On Linux this is visible on arch/x86/xen/xen-head.S as follows:

ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          _ASM_PTR startup_xen)

startup_xen is the respective first entry point of code used on Linux by Xen PV guest types, its defined earlier above in the asm code on the same file. Its implementation is rather simple, enough so we can include it here:

#ifdef CONFIG_X86_32
   mov %esi,xen_start_info
   mov $init_thread_union+THREAD_SIZE,%esp
   mov %rsi,xen_start_info
   mov $init_thread_union+THREAD_SIZE,%rsp
   jmp xen_start_kernel

On x86-64 this sets up what was in rsi to xen_start_info, it then uses what was on rsp to set up the stack before jumping to the first C Linux entry point, xen_start_kernel. The Xen hypervisor must have set up rsi and rsp. This is a bit different than what we expected...

Let's backtrack and show you what perhaps a sane Linux kernel developer expected you to set up, to do this let's look at how lguest loads Linux. lguest's launcher is implemented on tools/lguest/lguest.c. Of interest to us it parses the file we pass it as a Linux kernel binary and tries to launch it via load_kernel(). Read load_bzimage(), it reads the kernel passed, checks the magic string is present and loads the zero page from the file onto its own memory's zero page, finally returning boot.hdr.code32_start. This later part is used to kick off control into the kernel as its starting entry point. But of importance as well to us is that the zero page was read from the file, and used as a base to set up the "zero page". The lguest zero-page is further customized after load_kernel(), lets see a few entries below.

int main(int argc, char *argv[])
  /* Boot information is stashed at physical address 0 */
  boot = from_guest_phys(0);

   * Map the initrd image if requested
   * (at top of physical memory)
  if (initrd_name) { 
    initrd_size = load_initrd(initrd_name, mem);

    /* start and size of the initrd are expected to be found */
    boot->hdr.ramdisk_image = mem - initrd_size;
    boot->hdr.ramdisk_size = initrd_size;

    /* The bootloader type 0xFF means "unknown"; that's OK. */
    boot->hdr.type_of_loader = 0xFF;
   * The Linux boot header contains an "E820" memory
   * map: ours is a simple, single region.
  boot->e820_entries = 1;
  boot->e820_map[0] = ((struct e820entry) { 0, mem, E820_RAM });

   * The boot header contains a command line pointer:
   * we put the command line after the boot header.
  boot->hdr.cmd_line_ptr = to_guest_phys(boot + 1);

   * We use a simple helper to copy the arguments
   * separated by spaces.
  concat((char *)(boot + 1), argv+optind+2);

  /* Set kernel alignment to 16M (CONFIG_PHYSICAL_ALIGN) */
  boot->hdr.kernel_alignment = 0x1000000;

   * Boot protocol version: 2.07 supports the
   * fields for lguest.
  boot->hdr.version = 0x207;

   * The hardware_subarch value of "1" tells the
   * Guest it's an lguest.
  boot->hdr.hardware_subarch = 1;
And that's how sane Linux kernel developers expected you to do Linux kernel loading. Why does Xen's setup look so odd? What's this xen_start_info crap? Let's brace ourselves and dare to have a look at the Xen hypervisor setup code.

Xen defines what it ends up putting into the "xen_start_info" through a data structure it calls struct start_info, defined in xen/include/public/xen.h, it refers to this as the "Start-of-day memory layout". Of interest to us is who sets this up, for x86-64 this is done via vcpu_x86_64(), the relevant parts for is are listed below.

memset(ctxt, 0, sizeof(*ctxt));
ctxt-> = dom->parms.virt_entry;
ctxt->user_regs.rsp = dom->parms.virt_base +
  (dom->bootstack_pfn + 1) * PAGE_SIZE_X86;  
ctxt->user_regs.rsi = dom->parms.virt_base +
  (dom->start_info_pfn) * PAGE_SIZE_X86;

The dom's params are set up via xc_dom_parse_bin_kernel(), as with lguest it has a file parser and uses this to set up some information, and it also extends some information, but it never really sets up the zero-page. Instead it actually sets up its own set of data structures representing the struct start_info. It turns out the setting of the zero-page for PV guests is done once running Linux inside Linux kernel code on the first Xen C entry point for Linux, on xen_start_kernel() on arch/x86/xen/enlighten.c !

/* First C function to be called on Xen boot */
asmlinkage __visible void __init xen_start_kernel(void)
  if (!xen_start_info)
  /* Poke various useful things into boot_params */
  boot_params.hdr.type_of_loader = (9 << 4) | 0;
  boot_params.hdr.ramdisk_image = initrd_start;
  boot_params.hdr.ramdisk_size = xen_start_info->mod_len;
  boot_params.hdr.cmd_line_ptr = __pa(xen_start_info->cmd_line);

Its not documented so I can only infer that the architectural reason for this was to account for the different operating systems that Xen has to support, its perhaps easier to work with a generic data structure, populate that, and then have the kernel specific solution parse it out. While this might have been an original design consideration, it also has implicated a diverging entry point solution for Linux, which as I've highlighted recently in my last post on dead code on pv_ops, isn't ideal for Linux. The challenge to any alternative is to not be disruptive and remain compatible, not extend pv_ops, and providing a generic solution which might be useful elsewhere.
Post a Comment