Monday, November 17, 2014

Automating backport kernel integration support

I cringe when I see a task which could be automated done manually, but complex tasks are not trivially considered possible to be automated -- to even fathom such possibilities on complex tasks at times you have to divide the work into sub tasks and eventually see if its possible to automate a series of them and which ones cannot be automated. I've had a hunch about about the prospects of fully automating Linux kernel backporting for a while now, over the years a set of advances and practices on the backports project has increased my confidence of these prospects, one of them was Increasing Automation in the Backporting of Linux Drivers Using Coccinelle SmPL [paper]. If a long paper is too much to digest, check out the Automatically backporting the Linux kernel video presentation (and if you want to learn about Coccinelle SmPL check out Julia Lawall's Coccinelle tutorial at the 2014 SUSE Labs Con) from my presentation at the 2014 SUSE Labs Conference or previous blog post about that. Towards the end of my presentation I hint at some further prospects in automation with the possibility of doing self programming of the shared backports layer targetting collateral evolutions, but I'll now review one features some folks have pestered me for a bit to incorporate: direct kernel integration with backports and which I recently completed during the 2014 SUSE Hackweek.

We now have the framework to optimize backporting collateral evolutions with the use of patches, Coccinelle SmPL grammar patches, and a shared layer. The development flow we follow helps track linux-next daily, and this reduces the amount of work when we're close to a release made by Torvalds or Greg KH. Although we make both daily linux-next based releases and also stable releases what we provide is a tarball and users and system integrators have no way of making what we provide non-modular. This is a problem for some ecosystems such as Android and ChromeOS which do not like to ship modules. You can technically take such releases, modify them somehow, and then allow integration to be able to build these drivers as built-in and although I know some folks have used this strategy before (ChromeOS was one, OpenWrt has been doing this for years) its not easy to upkeep, and update, and when a new release is made you have to re-do all the work. As of backports-20141114 we now have backports kernel integration support merged. What this means is that folks that need to stick to older kernels as base can use the backports project to do the integration of drivers from future kernels onto their kernel, with full kconfig support. You get what you expect, a new sub menu entry under 'make menuconfig' which lets you enter a submenu that lets you enable either as module / built-in device drivers / subsystems from future kernels to replace your older kernel's drivers / subsystems. The work to integrate a backports release is therefore now automated.

As you'd expect device drivers from future kernels can only be selected if the respective older driver is disabled. You can opt to compile backported drivers as modular or built-in. The ability to compile in device drivers as built-in also now enables the possibility to add support into backports features and components from the kernel which we were previously not able to backport. Integration support enables a one shot full integration support from a future release to an older release, the way to upgrade then would require simply rebasing your kernel as you bump your base kernel and doing another kernel integration when needed. If you are not rebasing your kernel in order to only upgrade to a new future set of backported drivers you can just drop the old backports/ directory and attempt a new integration with the newer release. This means you should clearly document non-upstream cherry picks on top of a backport integration, cherry pick them out and later merge them back in. This purposely favours upstream development work flow, if your cherry picks are on route upstream when you bump to a new backport you likely will drop  most of the cherry picks you carry, in fact if you have policies in place to ensure they are upstream by a future release integration you'd be always striving towards 0 delta, and of course, 0 delta would imply fully automated backport work then. I hope this alone might encourage some folks to consider their own development work flows a bit, in particular those with over 6 million lines of code delta, and umm, with it taking them over 6 months to complete a rebase ;) ... On a modern laptop running the integration takes about 1-2 minutes to complete. More details are available the on the backports wiki section on backports kernel integration support. If you have any questions poke on IRC #kernel-backports on freenode or join the backports mailing list.

Monday, August 25, 2014

Hacking on systemd with OpenSUSE

I had recently had no other option but to hack on systemd :*( and found there wasn't any documentation on how to do this on OpenSUSE. Replacing your /sbin/init isn't as simple as it used to be back in the day, eventually I figured things out with a few hiccups but apart from the actual ability to hack and install systemd I also picked up a bit of good best practices you can use to help while testing, and dealt with installing kdbus as I was tired of seeing those pesky warnings from systemd without it. My first assumption that things would just work if I installed over my base install proved incorrect, so avoid that ;), I'll cover doing this with containers. While I don't yet have access to edit the freedesktop wiki I figured I'd document my steps here and later move that documentation once and if granted access.

First you need the equivalent of a deboostrap a la OpenSUSE. Since OpenSUSE is now a rolling distribution this documentation will focus on using those repositories. Since OpenSUSE embraces btrfs fully and it has copy-on-write bells and whistles to help you save space this small little guide will also provide instructions on using the btrfs snapshot capability of btrfs to help you use a base OpenSUSE install for further "branch" type of hacking. This will let your copies of the original install share the same base size / blocks on the hard drive and only make changes once you've modified the system. If you don't want to use the btrfs snapshot feature just ignore the btrfs commands and create a directory instead of the creation commands. This should let you hack without using up gobs of space. This should be considered a small supplement on hacking and testing systemd in a virtualized environment. As of 2014-08-05 the instructions here will create a small container for you that will take about 333 MiB of space.

First get your repos set up with the latest rolling distribution repo, if using btrfs might as well use the btrfs snapshot feature:

$ sudo btrfs sub create /opt/opensuse/
# If you don't want to use the snapshot just create the directory
$ sudo mkdir -p /opt/opensuse

This will let you install package binaries with zypper install

$ sudo zypper --root /opt/opensuse/ ar repo-oss

Quite a bit of packages require /dev/zero to be available. 

$ sudo mkdir /opt/opensuse/dev/
$ sudo mknod /opt/opensuse/dev/zero c 1 5
$ sudo chmod 666 /opt/opensuse/dev/zero

Then install a minimal set for hacking:

$ sudo zypper --root /opt/opensuse/ install rpm zypper wget vim sudo

Now get qemu-kvm and load then the kvm module.

$ sudo zypper install qemu-kvm
$ sudo modprobe kvm-intel

 Next you should launch systemd-nspawn (the systemd chroot equivalent) and change your root password before booting into it, and enable root login from the console.

$ sudo systemd-nspawn -D /opt/opensuse
Timezone America/New_York does not exist in container, not updating container timezone.
Directory: /root
Tue Aug 5 17:39:47 UTC 2014
opensuse:~ # passwd
New password:
Retype new password:
passwd: password updated successfully

By default OpenSUSE won't let you log in to the console as root, to enable that do:

opensuse:~ # echo console >> /etc/securetty

opensuse:~ # sed -i 's/session\s*required\s*    required' /etc/pam.d/login

To make it easier to hack on it'd be ideal to also just enable root access without a password, the involves making some PAM changes, and disabling the password for root, this still doesn't work for me so this is incomplete for now, ignore the next steps for now, I leave them here if anyone wants to continue to chug on that route and figure out the other steps.

opensuse:~ # sed -i 's/root:.*:\([0-9]*\)::::::/root::\1::::::/' /etc/shadow

Now you should be able to just boot into it using a container, shut down the container you were just in:

opensuse:~ # systemctl halt

Now give your new container a fresh spin with -b

$ sudo systemd-nspawn -bD /opt/opensuse 3
The -b will tell systemd to run init on the container and the number 3 tells systemd to launch the various services required for the A target is a way to group up required services. You should be able to log in using root.

Eventually you want to list and manage any deployed container, this includes killing them. For that you can use machinectl within your own system, not within the container.

$ machinectl
MACHINE                          CONTAINER SERVICE        
opensuse                         container nspawn         

1 machines listed.

To kill the one you just started for example:

$ sudo machinectl terminate opensuse
$ machinectl
MACHINE                          CONTAINER SERVICE        

0 machines listed.

To start hacking create new snapshot based on the original. This will let us easily create new OpenSUSE containers to hack on. Kill the base container first with machinectl before doing this though.

$ sudo btrfs sub snap /opt/opensuse /opt/opensuse-hack1

And then go at it on /opt/opensuse-hack1 to hack on your stuff. You can now follow the instructions on the freedesktop wiki on hacking on systemd on virtualized environment but it doesn't tell you to uninstall the distribution's version of systemd -- this is recommended, at least I ran into issues without doing this. To do that just remove the files the rpm installs. You can do this several ways:

From within your system, targeting the new container path:

$ rpm -ql --root /opt/opensuse-hack1/ systemd | sed -e 's|\(.*\)|/opt/opensuse-hack1\1|' | xargs rm -f
Something a bit more safe if you don't trust the above:

$ CONT="/opt/opensuse-hack1/"
$ for i in $(rpm -ql --root $CONT systemd); do if [[ -f $CONT/$i ]]; then sudo rm -f $CONT/$i ; fi ; done

And finally another simpler / secure way to do this from within the container, your container will just become useless after this though so you'll have to kill it from your system with machinectl after this.

linux:~ # rpm -ql systemd | xargs rm -f

All you need now is to compile systemd from sources from your system locally and then use DESTDIR=/opt/opensuse-hack1/, but be very sure to also the --with-rootprefix= option as by default systemd will leave it blank.

$ ./
$ ./configure CFLAGS='-g -O0 -ftrapv' --enable-compat-libs --enable-kdbus --sysconfdir=/etc --localstatedir=/var --libdir=/usr/lib64 --enable-gtk-doc --with-rootprefix=/usr/ --with-rootlibdir=/lib64  
$ sudo DESTDIR=/opt/opensuse-hack1/ make install

As of 2014-08-05 systemd from source by default will want the shiny new kdbus. Go read up on the lwn kdbus article, then since kdbus is not yet in the kernel you'll want to compile a fresh vanilla kernel (I don't provide instructions here obviously), install that and later compile and install the kdbus as a module form the external repo:

git clone
cd kdbus
# Use a known compilable version at least if you're on v3.16.0-rc7
git reset --hard 1f63f96686f9398eedde86b4e08581d14c6e403a
sudo make install

Finally you can now give your container a spin.

$ sudo systemd-nspawn -bD /opt/opensuse-hack1

To be sure you are getting a new systemd you can test the version systemd --version from the container.

Friday, July 25, 2014

Colored diffs with mutt

I cannot stand reviewing patches with gmail or any GUI e-mail client. I use mutt. On my last post I explained how you can apply patches directly from within mutt onto a git tree with a few shortcuts without leaving the terminal. This small post provides the next step to allow you to grow a mustache... I mean, get you to enjoy your mutt experience even more when reviewing patches by getting you colored diffs to match the same colors provided to you by good 'ol 'git diff'. Edit your .muttrc file and add these.

# Patch syntax highlighting                                                    
color   normal  white           default                                        
color   body    brightwhite     default         ^[[:space:]].*                 
color   body    brightwhite     default         ^(diff).*                      
color   body    white           default         ^[\-\-\-].*                    
color   body    white           default         ^[\+\+\+].*                    
color   body    green           default         ^[\+].*                        
color   body    red             default         ^[\-].*                        
color   body    brightblue      default         [@@].*                         
color   body    brightwhite     default         ^(\s).*                        
color   body    brightwhite     default         ^(Signed-off-by).*             
color   body    brightwhite     default         ^(Cc)  

Thursday, July 17, 2014

Applying patches from mutt onto a git tree easily

This post is for project maintainers using git who wish to merge patches easily into a project directly from mutt. Projects using git vary in size and there many different ways to merge patches from contributors. What strategy you use can depend on whether or not you are expecting to merge hundreds of patches, or just a few. If you happen to be very unfortunate and are forced to use Gerrit a mechanism was chosen for you for review and how patches will get merged / pushed. If you're just using raw git directly you can do whatever you like. For big projects git pull requests are commonly used. Small projects can instead live with manual patch application from an mailbox inbox. Even large projects can't realistically expect folks to be submitting every patch with pull requests, and so manual patch application also applies to large projects. How you get your patch out of your inbox and get it merged will vary depending on what software you are using to read your mailbox. Tons of folks are using gmail these days and even there it's not that easy: you'd have to go to the right pane, go to drop down menu and select "Show original", then save that page as a text file, edit it to remove the top junk right before the From: and finally you can then git am that file.

This doesn't scale well. A plugin can surely help but bleh, the command line is so much better. For that you can use Mutt. The typical approach on mutt is to use the default hooks to save a file onto disk and then go and 'git am' it. It'd be much easier if we just had hooks to apply patches directly into a git tree though. The following are configuration options you can use and a bit of shell that will allow that. Ben Hutchings's blog post on git and mutt in 2011 described a way to extract patches into a directory and then you'd just git am them. Those instructions no longer work on newer versions of mutt, I'll provide updated settings and also extended these hooks to allow you to apply patches without even having to drop down to another shell, while also giving you the option to inspect them manually if you wish.

Here's what I have on my .muttrc :

macro index (t '~/mailtogit/mail-to-mbox^M'  "Dumps tagged patches into ~/incoming/*.mbox"
macro index (a '~/mailtogit/git-apply-incomming^M'  "git am ~/incoming/*.mbox"
macro index (g '~/mailtogit/git-apply^M'  "git am tagged patches"
macro index (r 'rm -f ~/incoming/*.mbox^M'  "Nukes all ~/incoming/"
macro index (l 'ls -ltr ~/incoming/^M'  "ls -l ~/incoming/"        
macro index ,t '~/mailtogit/mail-to-mbox^M'  "Dumps currently viewed patch into ~/incoming/*.mbox"
macro index ,g '~/mailtogit/git-apply^M' "git am currently viewed patch"
macro index ,a '~/mailtogit/git-abort^M' "git am --abort"          
macro index ,r '~/mailtogit/git-reset^M' "git-reset --hard origin" 

The first hook (t allows you to dump patches you tag into an ~/incoming/ directory, mutt will show you what those are. The (a will apply all the patches that you just took out into that directory. The (g hook will merge the two steps into one and just dump the tagged patches and apply them immediately. If you have to clear the ~/incoming/ directory just use the (r hook. If you'd like to review what's in that directory you can use the (l hook. With ,t you can dump the currently viewed patch into ~/incoming/, this lets you extract a patch without tagging it. The ,g hook will also skip having to tag a patch and just apply it. If you want to abort a 'git am' operation you can use ,a. Finally to reset your tree to origin, just use the ,r hook.

This all depends on 5 small scripts, the ones that change directory obviously are making these scripts depend on one single projects so the question arises as to how to generalize this so that mutt is aware of the project a patch was sent for and you can apply it to that right tree so that we don't have to stuff mutt with tons of different project specific hooks. There are two approaches that come to mind, one is to have the shell script read the List-ID tag, for example List-ID: , and have a mapping of those to git trees. The other is to trust rather the directory the e-mail went in under mutt, which assumes you already had filters for each List-ID. The issue with both of these approaches is that at times a patch may go to multiple lists but in Linux' case, where this does apply, it should be specific to at least one git tree you do care, unless I guess you are maintaining multiple subsystems. Another possibility that comes to mind is to have git format-patch add yet-another-tag into the e-mail that it spits out the e-mails used for submission, perhaps Gid-ID: and the tree? This also has some issues though for many reasons, so for now this is what I have and use. Let me know if you come up with something more generic.


formail -cds ~/mailtogit/procmail -
ls -l ~/incoming/


cd ~/backports
git am ~/incoming/*.mbox


rm -f ~/incoming/*
cd ~/backports
git am -s ~/incoming/*.mbox


cd ~/backports/
git am --abort
rm -f ~/incoming/*


cd ~/backports/
git reset --hard origin
rm -f ~/incoming/*

Tuesday, May 27, 2014

Building and booting vanilla Xen on vanilla Linux with systemd

If you want to do Xen development you should be working with upstream sources, and you should be sending your patches upstream, ASAP, that is before they are even in production. There simply should be no ifs or doubts about this. Doing it any other way is simply detrimental in the long run. I'm new to virtualization but from the architectural look of it I consider kvm a good reaction to virtualization evolution with focus for a clean new architecture that pairs up best with the latest hardware enhancements only. The decision to not support new bells and whistles on things that could be done through software but instead designed with hardware support eliminates tons of support on the software side, but obviously it relies on the assumption that folks will upgrade hardware and that the hardware was designed properly. Xen however is full of a rich history, experience, and flexibility, and as such its important to realize that there should be no easy decision to claim what is a better solution right now.

One thing I'm sure: both solutions at this point have a rich set of expertise and design goals to be learned from, the one thing I see kvm doing right is pushing Upstream First (TM) as a motto. Xen should learn from that strategy as there are markets and innovative groups who appreciate this tremendously. With the rapid pace of evolution of the Linux kernel, there is simply no other way, and because of this Xen development should change to a must be working upstream only model, and join the Upstream First (TM) bandwagon. In this post I will dive into the recipes required to get the latest Xen and vanilla Linux sources and get you started on the Upstream First (TM) bandwagon with Xen. I provide instructions for getting both Xen and the upstream Linux kernel configured properly. I will ignore anything not upstream on the Linux kernel, as what we need to do with that delta is just get it upstream. Additionally since even Debian has casted votes on supporting systemd as a Linux init replacement I'll also provide instructions on how to get systemd support on xen with active socket support as it seems that's the way of the future for all Linux distributions. Both Fedora 20 and OpenSUSE 13.1 have already jumped on systemd so you'll want proper systemd support for these, as it stands right now Xen does not have service unit files as part of its upstream sources, patches are in the works though and this posts also illustrates some corner cases found while implementing support, some general systemd autotools library helpers defined to make it easier for others to integrates support for systemd and an example code base which makes elaborate use of these helpers.

Please note that compiling xen with systemd support enables binaries to be used for systems either using legacy init or systemd using the the v5 series of integration patches documented here, systemd support patches are not yet merged upstream, but to help provide wider coverage support you should enable its support as per the instructions below and report any issues you have found to me. Since I wish for as many folks to jump on the upstream bandwagon I'll cover instructions only for getting the latest xen to run on the latest stable vanilla kernel over a slew of Linux distributions, this includes the Linux kernel as well as xen, and resolving all your dependencies. I'll recommend building and embracing oxenstored for reasons I've stated before, after all if you run into issues with the latest systemd series of patches you can easily revert back to cxenstored by a simple flip on the configuration file on either /etc/sysconfig/xencommons (rpm based distributions) or /etc/defaults/xencommons (Debian based distributions) (Note: this last part still needs to be worked on, right now this requires a bit more work for systemd).

I have built tested the below instructions on OpenSUSE Tumbleweed, Debian testing, and Fedora 20. I have only run time tested this on OpenSUSE Tumbleweed and Debian testing. Reports for any issues on run time on Fedora 20 and Ubuntu are appreciated. Instructions for other Linux distributions are welcomed so I can extend the documentation here while systemd support patches get baked upstream, after that I will move all documentation to the xen wiki.

Getting an updated /sbin/installkernel 


Linux distributions shipping with grub2 will need to ensure that their /sbin/installkernel script, which has to be provided by each Linux distribution, copies the the kernel configuration upon a custom kernel install time. The requirement for the config file comes from upstream grub2 /etc/grub.d/20_linux_xen which       
will only add xen as an instance to your grub.cfg if and only if it finds in your config file either of:                                           
Without this a user compiling and installing their own kernel with proper support for xen and with the xen hypervisor present will not get their respective grub2 update script to pick up the xen hypervisor. Debian testing has proper support for this, OpenSUSE required this change upstream on mkinitrd, so OpenSUSE folks will want to get the latest /sbin/installkernel hosted on the OpenSUSE mkinitrd repository on github.

# If on OpenSUSE update your /sbin/installkernel
git clone
cd mkinitrd
sudo cp sbin/installkernel /sbin/installkernel 

Fedora might need a similar update. I welcome feedback on confirming this.

Xen systemd build dependencies on OpenSUSE



# If you're now on the latest OpenSUSE you'll note its now a
# a rolling distribution base for (and also called Factory)
# The default instructions do not actually encourage you to
# install the source repositories, and even if you did
# install them the instructions disable them by default, so
# be sure to install them and enable them otherwise
# the command zypper source-install -d won't work.
# To enable the required repository if you already had it
# installed:
sudo zypper mr -e repo-src-oss

# Get the build dependencies for Xen
sudo zypper source-install -d xen

# Things not picked up by the build dependencies
sudo zypper install systemd-devel gettext-tools\
ocaml ocaml-compiler-libs ocaml-runtime \
ocaml-ocamldoc ocaml-findlib glibc-devel-32bit make patch

# Get build dependencies for Linux
sudo zypper source-install -d kernel-desktop

Xen systemd build dependencies on Debian testing and maybe Ubuntu


Note that these instructions are not to enable systemd as the init process on Debian, although there are some instructions here to help you with that if you wish to venture into that.

sudo apt-get build-dep xen linux
sudo apt-get install git libsystemd-daemon-dev \
libpixman-1-dev texinfo

Xen systemd build dependencies on Fedora 20 


Fedora may need an update to /sbin/installkernel as OpenSUSE did for grub2 support, see the notes above for more details on that. Verification on this is appreciated.

# Get build dependencies for xen
sudo yum-builddep xen

# Things not picked up by the build dependencies
sudo yum install glibc-devel.x86_64 systemd-devel.x86_64

# Get build dependencies for Linux
sudo yum-builddep kernel 

Getting the code

Next go get Linux and Xen sources.

git clone git://
git clone git://

Configuring vanilla Linux with xen support


cd linux
patch -p1 < linux-xen-defconfig.patch
cp /boot/config-your-distro-config .config
make xendom0config
make -j $(getconf _NPROCESSORS_ONLN)
sudo make install


Configuring xen with oxenstored and systemd support


cd xen
git reset --hard 86216963fd1d89883bb8120535704fdc79fdad50
git am all-v5-series-xen-systemd.patch
./configure --with-xenstored=oxenstored --enable-systemd
make dist -j $(getconf _NPROCESSORS_ONLN)
sudo make install
sudo ldconfig

# If on systemd, that is, if you have /run/systemd/system/
sudo systemctl daemon-reload

The last step is to enable the systemd unit services you want, if you want to test the active socket stuff, just enable xenstored.socket, and after reboot you can just use netcat as root to tickle the socket as described below, if you just want to have the xenstored service already running enable the xenstored.service, which will also enable xenstored.socket as its a dependency.

sudo systemctl enable xenstored.socket
sudo systemctl enable xenstored.service

The last step is to ensure the grub config updated to pick up the xen hypervisor. This varies depending on Linux distributions. Below we cover the distributions that I have tested booting on.

Updating grub for Xen on OpenSUSE


sudo update-bootloader --refresh

Updating grub for Xen on Debian and maybe Ubuntu


sudo update-grub

Reboot and test 


That's all, reboot and make sure you pick the right grub entry. Typically grub2 will list regular kernel entries and hypervisor entries separated, with the option to go into advanced settings for each one. Entering the advanced settings for the hypervisor will enable you to pick the exact kernel you want to boot to. If you have hardware with some virtualization capabilities you'll want to enable that, this is done on through the BIOS / UEFI menu. Below are some pictures of enabling the features on a Thinkpad T440p, and then the flow through grub2.

Get into the virtualization menu on the system BIOS / UEFI menu.

On Intel hardware this will be labeled as Intel Virtualization Technology and Intel VT-d Feature. For AMD the name is some other flashy similar thing.

Boot into grub and you should now see an option for your distribution with the Xen hypervisor, pick that if you want to go with the defaults, but if instead you want to browse each hypervisor available pick the advanced options.

 If you picked the default hypervisor option you should be booting into the Xen Hypervisor and that in turn will boot your kernel / distribution. If you picked the advanced option you'll see the options for the hypervisor as below. In my case I have only the bleeding edge unstable version from git of the Xen hypervisor.

Next it will let you pick the kernel you want to boot your hypervisor with. All of the kernels with support for Xen will be displayed.

After this you should be booting into the Xen hypervisor and this in turn will boot Linux as dom0.

After bootup


 Starting xen with old init


First verify you booted into a xen hypervisor first as follows:

mcgrof@garbanzo ~ $ cat /sys/hypervisor/type

You're all set, the next step is to start Xen. On Linux distributions stuck on old init like Debian right now you just have to spawn the old init script. This is done as follows:

mcgrof@garbanzo ~ $ sudo /etc/init.d/xencommons start
Starting /usr/local/sbin/oxenstored...
Setting domain 0 name and domid...
Starting xenconsoled...
Starting QEMU as disk backend for dom0

mcgrof@garbanzo ~ $ echo $?

You are ready to start creating guests!

Starting xen with systemd


First thing is to ensure your dom0 is now booted on the xen hypervisor. If you have systemd you can do this easily with:

mcgrof@ergon ~ $ sudo systemd-detect-virt

Under the hood this is the same as the following:

mcgrof@garbanzo ~ $ cat /sys/hypervisor/type

If you only enabled xenstored.socket you can verify the sockets by:

mcgrof@ergon ~ $ sudo netstat -lpn | grep xen
unix  2      [ ACC ]     STREAM     LISTENING     13976  1/init              /var/run/xenstored/socket
unix  2      [ ACC ]     STREAM     LISTENING     13979  1/init              /var/run/xenstored/socket_ro

You can also use systemd:

mcgrof@ergon ~ $ sudo systemctl list-sockets| grep xen
/var/run/xenstored/socket    xenstored.socket             xenstored.service
/var/run/xenstored/socket_ro xenstored.socket             xenstored.service

You can also verify the socket unit:

mcgrof@ergon ~ $ sudo systemctl status xenstored.socket
xenstored.socket - Xen xenstored / oxenstored Activation Socket
   Loaded: loaded (/usr/local/lib/systemd/system/xenstored.socket; enabled)
   Active: active (listening) since Thu 2014-05-15 01:12:53 PDT; 16min ago
   Listen: /var/run/xenstored/socket (Stream)
           /var/run/xenstored/socket_ro (Stream)

May 15 01:12:53 ergon systemd[1]: Starting Xen xenstored / oxenstored Activation Socket.
May 15 01:12:53 ergon systemd[1]: Listening on Xen xenstored / oxenstored Activation Socket.

Next, you can check to see if xenstored.service is running, it should not be if you didn't enable it and only enabled xenstored.socket:

mcgrof@ergon ~ $ sudo systemctl status xenstored.service
xenstored.service - Xenstored - daemon managing xenstore file system
   Loaded: loaded (/usr/local/lib/systemd/system/xenstored.service; disabled)
   Active: inactive (dead)

Next to see the active socket magic trigger you can just use netcat to tickle any of the sockets. Since the permissions are only to grant access to the root user you'll need root to tickle the socket.

mcgrof@ergon ~ $ sudo nc -w 1 -U /var/run/xenstored/socket_ro
mcgrof@ergon ~ $ echo $?

Now verify the xenstored.service is loaded:

mcgrof@ergon ~ $ sudo systemctl status xenstored.service
xenstored.service - Xenstored - daemon managing xenstore file system
   Loaded: loaded (/usr/local/lib/systemd/system/xenstored.service; disabled)
   Active: active (running) since Tue 2014-05-20 04:33:09 PDT; 1 day 16h ago
 Main PID: 1621 (oxenstored)
   CGroup: /system.slice/xenstored.service
           └─1621 /usr/local/sbin/oxenstored --no-fork

May 21 21:24:24 ergon oxenstored[1621]: xenstored is ready

Why you want active sockets


Systemd has support for "active sockets" or "socket based activation", but this concept is not new, socket based activation was pioneered by Apple's Launchd, and that software was released under the Apache 2.0 license, that project got its first release in 2005, while systemd's initial release dates 2010. Go and watch Dave Zarzycki's talk at Google about Launchd, there's tons of talks about systemd and, here's an old introduction talk about systemd it by Lennart Poettering, and Lennart does give Apple proper kudos here. Systemd is simply ├╝ber optimized for Linux, it takes advantage of tons of special Linux kernel enhancements. Socket based activation is ideal for local service, AF_UNIX sockets, although support does exist for inet sockets as well. There are two reasons why you want active sockets:
  1. On demand auto-spawning
  2. Help with bootup parallelizaiton
The on demand auto-spawning can be taken advantage by xen if and only if its tools are converted to try to open the unix socket when they run, but they currently don't do this and some communication uses the kernel ring interface, not the unix domain sockets. If you use the stubdoms you also never end up using  the unix domain sockets. The gains from parrallelization however are awlays welcomed, you essentially let systemd figure out how to bring things up by associating dependencies rather than trying to pile things up in a specific strict numbered order, this is all controlled by the service unit files and the requirements specified. Udev lends a here as well, which is not merged part of systemd, but I'll have to cover udev on another post. If one had an ecosystem that one was sure did not require the service to be spawned up all the time and you didn't need the kernel ring interface immediatley up, you could just either enable only the xenstored.socket or remove this section from the xenstored.service:

A few things worth noting for daemons and systemd that I do not see covered clearly in documentation, the exact expectations on the different type of service types. Systemd supports different types of daemons, for those that don't fork you should declare in your service unit file a type of:


For daemons that do call fork() you should use the following:


In legacy init world, this consists of most of the daemons out there. There's a bit of a caveat here though: systemd expects you to behave in a certain way if you use Service=forking, your first parent process should be the one to call sd_notify_fds(), you should not let child processes do the sd_notify_fds() call. What deamons do vary and the assumption on systemd that daemon's spawn sockets on the parent rather than children means deamons will need a bit of a change in order to work with systemd properly as there is no way to tell systemd a child is going to be the main process, even if you try sd_notifyf() with the process ID of the child. Arguably there's a good reason for this though, you should consider using Service=notify and when you use this type of service you don't fork as part of your deamonizing effort, instead you just tell systemd when your service is ready with sd_notify(). There's some curious architectural design principles worth elaborating on that comes with this that highlight a mistake typically in place on some deamons that do fork. When deamonizing and forking killing the parent immediately is the easy and fastest way from a programmer's perspective but should typically not be done given that regular legacy init that spawn daemons in order will enable processes to make use of the daemon under the impression that the deamon is ready, leaving a small amount of time for a race condition to trigger. Typically this is addressed with nasty undocumented workarounds, for example retry connections to connect to the unix domain sockets on daemons that are expected to be created after initialization. Mind you, the race condition is small but yet very possible, specially if we want to boot up fast. This is one of the races that systemd services using sd_notify() avoid by design. This is pretty cool.


funk-systemd - example complex systemd daemon 



Apart from corner cases there is also the complexities introduced by the different types of build systems / target systems, specially for projects which really want to support multiple Operating Systems and init systems such as Xen. To address different build environments and targets a lot of projects use autotools, Xen follows this practice so integrating support for systemd on Xen required proper autotools support. Autotools support with systemd can get complicated fast -- you see, systemd does not allow variable placements on ExecStart settings for the binary you wish to run, this means that if your project uses configure to dynamically place the path of the binary you will also need proper replacement for the paths upon configure time. With autotools this is accomplished with the AC_CONFIG_FILES() helper but in order to make use of some paths with AC_CONFIG_FILES() you'll want to eval and call AC_SUBST() on them. This is not only useful for the ExecStart but also consider the different placements of the socket files. If using ${prefix} for any of the paths you will need to work with a not-so-well documented $ac_default_prefix. You also have to consider the different types of build environments and the different types of target systems that a project wishes to support for a produced single binary daemon. The different build environments may vary.  A project may wish to support forcing systemd to be present, some may wish to only use systemd if the development libraries are present, and others may with to require you to specify that you want systemd explicitly. As far as target systems are concerned -- they vary as well, in the worst case scenario a project may wish to support legacy init with and without systemd libraries present and then for the case where systemd is the init process. In this example situation if its desirable to support a single binary for all types of init systems the dynamic link loader (using dlopen(), dlsym()) can be used, or a in-place replacement for sd_booted() can be implemented as well instead of relying and calling on the systemd helper sd_booted(). A project such as Xen that supports two daemons for the same type of service also needs to consider which route to take for supporting and maintaining service until files for the different possible daemons. There's different strategies for this. A lot of this is not well documented, and good examples for for projects as complex as Xen's build system are not readily available, let alone cover all the cases I've described. Becuase of all this and since I ended up doing the work for systemd Xen integration I made sure to try to generalize a solution and address all types of environments as described above, I have also stuffed a sample daemon which also covers documents the legacy init corner case that sd_notify() explicitly addresses. You can find the sample code here, the autoconf helpers defined and documented here are also being submitted as part of the xen system integration patches:

To look at an example solution for the legacy init race condition look at the usage of funk_wait_ready() which is called on the parent process that forks. As for xen, the legacy init daemon has as part of init script a retry counter, we should be able to remove that code with a similar solution for the legacy socket implementation. In this tree you will also find a few helpers if you want to get ramped up with systemd and autoconf which xen's systemd ingration patches make use of:
  • src/m4/systemd.m4 - systemd autoconf library which enables easy build integration support for systemd. There are four build options supported
    • AX_ENABLE_SYSTEMD() - enables systemd by default and requires an explicit --disable-systemd option flag to configure if you want to disable systemd support.
    • AX_ALLOW_SYSTEMD() - systemd will be disabled by default and requires you to run configure with --enable-systemd to look for and enable systemd
    • AX_AVAILABLE_SYSTEMD() - systemd will be disabled by default but if your build system is detected to have systemd build libraries it will be enabled. You can always force disable with --disable-systemd. This is the option we have decided to use for Xen.
    • If you want to use the dynamic link loader you should use AX_AVAILABLE_SYSTEMD() but must then ensure to use -rdynamic -ldl when linking, if using automake autotools will deal with this for you,otherwise you must ensure this is in place on your Makefile.
  • src/m4/paths.m4 - Implements AX_LOCAL_EXPAND_CONFIG() which you can use to replace meta @VAR@ variables on files defined with AC_CONFIG_FILES(). You might want to make use of this for example on systemd service unit file ExecStart, on the socket definition file, and/or the code that connects to the sockets.
  • src/funk_dynamic_helpers.c  - example systemd integration implementation support using the dynamic link loader -- using dlopen() and dlsym() which can be used for the one-binary-fits all solutions. Although a solution with this strategy was tested for systemd, this is not the option we are going to support on Xen.
  • funk daemon with-autoconf implementation  - example implementation with the above helpers with autoconf support alone
  • funk daemon with-automake implementation - example implementation with the above helpers with automake support
  • README and INSTALL - read these for more details on this example


Systemd support for projects with multiple daemon replacements



Xen is a good example of a project that requires support for multiple alternative binaries that can run as the daemon. For such type of situations there are a few possible solutions, this has been discussed only briefly on the systemd-devel list, you can end up implementing:
  1. Define a service unit file each for daemon, and define one target which defines the overall service. Service unit files that require the service will require the target, not the actual service unit file. The service unit files are then mutually exclusive with each other, the system administrator would then have to then manually select which service unit to enable. The downside to this strategy is you end up with multiple service unit files which in the worst case are identical and only differ on the ExecStart path.
  2. Define a service unit file for each daemon and define an Alias=foo.service for the general service. Services that need to depend on this service would then Require the alias, not the specific service file for each binary. The same downside is present with this solution.
  3. One service file and environment variables to be used by a binary launcher which will get use getenv() and execve() to launch the respective preferred daemon. This option gives the flexibility to be easily compatible with legacy init daemons that typically require /etc/sysconfig/  or /etc/default/ configuration files. Although Lennart has clarified that ideally the systemd-way could be to ignore /etc/sysconfig and /etc/default all together this solution would still enable to ignore /etc/sysconfig/ and /etc/default/ by requiring the default variable to be set via Environment=FOO_DEFAULT_DAEMON=/usr/local/sbin/bar. For support with legacy init systems the EnvironmentFile=-/etc/sysconfig/foodaemon and EnvironmentFile=-/etc/default/foodaemon can be used.
No example code or service unit files is provided at this point, what we end up doing for Xen remains to be decided.

Ocaml and systemd support


Xen has an ocaml implementation of the xenstore so as you can imagine we also had to add some support for systemd with ocaml. I won't provide examples here, but just not that support has been provided using a C interface wrapper. For details please review the posted patches.

Tuesday, April 08, 2014

Open Research through collaborative development

Academia helps shape our lives but it also helps with economics whether that be privately funded or publicly funded for evaluation of exploring new markets for capital gain, general well being and progress. One aspect of both type of funded research efforts are concerns over getting your ideas taken (being "scooped" seems to be the term used) and not getting any funding all together even if your ideas are very promising. If you follow my blog posts I hope its clear by now that I am terribly concerned over rapid evolution but are looking for solutions. As a collateral to the Internet and efforts behind free software and open source software we have spawned new mechanisms that can help research tremendously with rapid progress, one of them obviously being collaborative development models. In this post I will explain and encourage folks to look into a few new areas of development in research and to consider a bit more seriously how they're spending their time and money.

Things have changed quite a bit since the inception of the Internet and one example of a prominent innovative pioneer that has been very vocal about preparing us for a series of new advances is Ray Kurzeweil. For example he's very vocal about expressing concerns over legacy roadmaps on currently established and well known schools such as MIT, its in fact one of the reasons why we have the spawning of Singularity University backed by NASA, Google and others partners. If we are going to start trying to even consider to address the "humanity's greatest challenges", which actually was a requirement by Google to back Singularity University (6:54), we need research to be transparent, embracing and shepherding collaborative development models, and even addressing fair use of "Intellectual Property". Fortunately at a Ted talk where Ray announced Singularity University he sated that (7:30) "these projects [which started as intensive group summer sessions in 2009 to address humanity's greatest challenges] will continue past these sessions using collaborative development methods and all the Intellectual Property that is created will be online and available and developed online in a collaborative development fashion". I'm hoping that Google will live up to its promise to ensure that Singularity University lives up to its promise and that any concerns over Intellectual Property will be addressed.

Another new research effort announced recently was the Knight News Challenge. In June 2014  they will award $2.75 million, including $250,000 from the Ford Foundation, to support the most compelling ideas and projects that make the Internet better. A recent entry into the competition addresses the concerns of funding and folks taking your ideas (scooping):
"If everyone knows you were the first to propose (and actually pursue) that idea, anyone who tries to sell it as their own will risk loosing reputation"
Also there are two carrots: 1) for the casino fund (1% funding to this pool by different parties funding legacy research) contributors there is expected research studies that will take place on increasing efficiency of funding research, and 2) for researchers there are new incentives provided by the slew of new changes incurred by an open strategy such as news coverage, public documentation, and of course the ability to socialize ideas for more funding and of course... the gains from public collaborative development. Folks who get already established grants through legacy research could also help contribute 1% to the funds for promising but unfundable by traditional means. I'm pretty confident Bradley M. Kuhn would cringe at the idea that it seems this research effort however is underselling it by trying to target research only in the current category of "promising but unfundable by traditional means", as he recently posted about Open Surce as a last resort. He'd be right and the fact that tons of money and interest is pouring into Singularity University through a different approach should be proof that new research using collaborative development models should not be undersold only to "promising but unfundable by traditional means". With that said, it doesn't mean that they are restricting their ideas submitted only to that category... so any daring researcher with an idea to help spawn "projects that make the Internet better" but confident or curious on the gains of collaborative development should seriously consider submitting their proposal for evaluation. Two biophysicists have signed up for the competition already, who's next?

The prospects can set great precedents, its the type of stuff that I think we ultimately need to avoid the next big "race", the last one being the atomic race, the next one, in my opinion, likely being the Artificial Intelligent race or collateral because of it in light of other research. Another curious thing is that there seems to be an intersection between the folks at Singularity University and the Knight News Challenge and I'd be curious to know if they have considered... you know, collaborating together. Just a thought ;)

Monday, April 07, 2014

Summary of the gains of Xen oxenstored over cxenstored

Apart from upkeeping ongoing FOSS projects I help maintain and push forward one of the first things I've been asked to help with at SUSE is Xen, specifically helping address the huge delta in place with upstream. Before you give me the kvm lecture, realize that I'm very well aware of kvm now and while architecturally I think its beautiful tons of folks are still investing a lot into Xen and even new industries are considering it. As an example at the Linux Collaboration summit there was a talk by Alex Agizim about using Xen in the automotive industry by the folks at Global Logic. They prefixed their talk with a great video of on Steeri - Driverless car parody, hope is that that's not what things will be like. As we move forward with Xen my goal will also be to see what folks are doing on kvm and see if there might be anything to share or learn from. Before starting at SUSE I knew squat about Xen so I figure as I ramp up I can help with the documentation as well. Learning about Xen has been fun as it involves tons of areas of the kernel, and the history is very rich. As I ramp up I intend on helping with the the documentation on its wiki. As a collateral in dealing with the delta for upstream and documentation at times I may look for better way to do things, specially if it reduces our delta or if it can help the project, or at the very least socialize the ideas for a future feature enhancement. Apart from helping on the wiki, which I think is critical, I'll try to post things every now and then about parts of its architecture which perhaps don't yet belong on the wiki, or may use my blog post things first here, and then go curate them over into the wiki. I've now sent my brain dump to a few people over the summary of Thomas Gazagnaire and Vincent Hanquez's paper (both at Citrix) on their implementation of a xenstore in an OCaml implementation called oxenstored. I will likely want to point more folks to this summary later given I'm actually also interested in alternatives and I don't expect folks to read a full paper to evaluate alternatives. I'm not going to get into the specifics of what I hope to see in alternatives now though other than mentioning that this came about in discussions at the Linux Collaboration summit in Napa and that it involves git. In this post I'll just cover the basic generals of the xenstore, a review of the first implementation and a summary of oxenstored.

The paper: OXenstored - An Efficient Hierarchical and Transactional Database using Functional Programming with Reference Cell Comparisons

First a general description of the xenstore and its first implementation. The xenstore is where Xen stores the information over its systems. It covers dom0 and guests and it uses a filesystem type of layout kind of how we keep a layout of a system on the Linux kernel in sysfs. The original xenstored, which the paper refers to a Cxenstored was written in C. Since all information needs to be stored in a filesystem layout any library or tool that supports designing a tree to have key <--> value store of information should suffice to upkeep the xenstore. The Xen folks decided to use the Trival Database, tdb, which as it turns out was designed and implemented by the Samba folks for its own database. Xen then has a deamon sitting in the background which listens to reads / write requests onto this database, that's what you see running in the background if you 'ps -ef | grep xen' on dom0. dom0 is the first host, the rest are guests. dom0 uses Unix domain sockets to talk to the xenstore while guests talk to it using the kernel through the xenbus. The code for opening up a connection onto the c version of the xenstore is in tools/xenstore/xs.c and the the call is xs_open(). The first attempt by code will be to open the Unix domain socket with get_handle(xs_daemon_socket()) and if that fails it will try get_handle(xs_domain_dev()), the later will vary depending on your Operating System and you can override first by setting the environment variable XENSTORED_PATH. On Linux this is at /proc/xen/xenbus. All the xenstore is doing is brokering access to the database. The xenstore represents all data known to Xen, we build it upon bootup and can throw it out the window when shutting down, which is why we should just use a tmpfs for it (Debian does, OpenSUSE should be changed to it). The actual database for the C implementation is by default stored under the directory /var/lib/xenstored, the file that has the database there is called tdb. On OpenSUSE that's /var/lib/xenstored/tdb, on Debian (as of xen-utils-4.3) that's /run/xenstored/tdb. The C version of the xenstore therefore puts out a database file that can actually be used with tdb-tools (actual package name for Debian and SUSE). xentored does not use libtdb which is GPLv3+, Xen in-takes the tdb implementation which is licensed under the LGPL and carries a copy under tools/xenstore/tdb.c. Although you shouldn't be using tdb-tools to poke at the database you can still read from it using these tools, you can read the entire database as follows:
 tdbtool /run/xenstored/tdb dump
The biggest issue with the C version implementation and relying on tdb is that you can live lock it if you have have have a guest or any entity doing short quick accesses onto the xenstore. We need Xen to scale though and the research and development behind oxenstored was an effort to help with that. What follows next is my brain dump of the paper. I don't get into the details of the implementation because as can be expected I don't want to read OCaml code. Keep in mind that if I look for a replacement I'm looking also for something that Samba folks might want to consider.

OXenstored has the following observed gains:
  • 1/5th the size in terms of line of code in comparison to the C xenstored
  • better performance increasing support for the number of guests, it supports 3 times number of guests for an upper limit of 160 guests
The performance gains come from two things:
  • how it deals with transactions through an immutable prefix tree. Each transaction is associated with a triplet (T1, T2, p) where T1 is the root of the database just before a transaction, T2 is the local copy of the database with all updates made by the transaction made up to that point, p is the path to the furthest node from the root T2 whose subtree contains all the updates made by the transaction up that point.
  • how it deals with sharing immutable subtrees and uses 'reference cell equality', a limited form of pointer equality, which compares the location of values instead of the values themselves. Two values are shared if they share the same location. Functional programming languages enforce that multiple copies of immutable structures share the same location in memory. oxenstored takes avantage of this functional programming feature to design trie library which enforces sharing of  subtrees as much as possible. This lets them simpilfy how to determine and merge / coalesce concurrent transactions.
The complexity of the algorithms used by oxenstored is confined only to the
length of the path, which is rarely over 10. This gives predictable performance
regardless of the number of guests present.