The Kernel Virtual Machine is a quick and easy way to run other operating systems inside Linux. In particular, I find it much easier to use than VMWare.

Set Up KVM

KVM is already compiled into the Feisty kernel. You just need to

  $ sudo apt-get install kvm
  $ sudo modprobe kvm-amd kvm-intel   # (select the appropriate module)

Before running kvm. You should probably add the appropriate module to /etc/modules. If you run KVM without having the appropriate module installed, it will print a warning that kernel acceleration could not be used and then run the guest at 1/5 speed or less. It's painful.

Installation

This section describes how to create complete disk images from scratch.

Create a Blank Image

  $ qemu-img create -f raw vdisk.raw 10G

raw files are sparse so, even though it looks like a 10G file to ls, it takes up no room whatsoever on your disk (ls shows the logical size of the file; du shows the actual number of disk blocks it takes). Usable formats are:

  • raw: upside: fastest, simplest, easily mountable over loopback. downside: no encryption or compression possible, very wasteful if your filesystem doesn't support holes, very wasteful to transfer over the network unless first compressed.
  • qcow: upside: supports encryption and compression, can be used on all filesystems (doesn't require holes). downside: can't be mounted on the host filesystem.
  • cow: UML COW format. Old format, does not work on Win32.
  • vmdk: VMware 3 and 4 compatible.
  • cloop: Linux compressed loop, compressed CD-ROM images used by i.e. Knoppix.

Working with raw images

A 20 gigabyte raw image is always 20 GB in size. However, sectors on your physical disk aren't allocated until the virtual machine actually writes to them.

  bronson@eva:~/lb$ ls -lh vdisk.raw
  -rw-r--r-- 1 bronson bronson 10G 2007-01-30 00:07 vdisk.raw
  bronson@eva:~/lb$ du -hs vdisk.raw
  0       vdisk.raw

Or, use the qemu-img info command to describe the disk image:

  bronson@eva:~/lb$ qemu-img info vdisk.raw
  image: vdisk.raw
  file format: raw
  virtual size: 10G (10737418240 bytes)
  disk size: 0

Working with images

As you fill up the virtual disk, of course, more and more physical storage will be allocated to the raw file. If disk space is an issue, convert the disk image to a compressed qcow file:

  $ qemu-img convert -c img.raw -O qcow img.qcow

This compresses every piece of data in the disk image. Note that whenever a sector is written, it is written uncompressed. To prevent writes from becoming permanent, run kvm with -snapshot set.

A normal Ubuntu install can use 500MB to 2GB of storage, depending on if you install a full X environment or not. Right now the easiest way to perform a minimal install is to use the Ubuntu Server install disk.

Using your local filesystem

You can specify vvfat:dir to use a directory on your local filesystem. No need to create a disk image or a physical partition. i.e.:

   -hda vvfat:/tmp/dir

will cause /tmp/dir to be the root of the emulated hda disk.

TODO: how stable is vvfat? Is it read/write?

Using physical disks

It's handy to be able to boot a physical partition. qemu supports this by specifying, i.e. -hda /dev/hda. The problem with this is that it gives quemu free reign over all partitions on hda, probably not what you want. One way to get around this is to specify -hda part:/dev/hda3. This makes qemu think that the hda3 partition on the physical disk is the entire hda disk in the virtual machine. This is a work in progress however. Another, less useful way is to actually install another partition table at the beginning of the /dev/hda3 partition, but this means that the physical machine cannot use it. So, if part:/dev/hda3 works, use it. Otherwise, just give qemu access to the entire disk and pray that an emulated process doesn't stomp on the wrong partition.

Install from CD-ROM

This command boots the disk in the host's CD-ROM drive in the virtual environment.

  $ sudo kvm -hda vdisk.qcow -cdrom /dev/cdrom -boot d -m 256 -smp 1

NOTE: most systems are adding a 'kvm' group. Instead of running the previous command as the super user, try adding your user to that group.

It's probably easier just to specify an .iso file on your hard drive ("-cdrom ~/Desktop/ubuntu-6.10-server-i386.iso"). -m tells how much memory to use; 256M is a safe minimum. You can emulate multiple processors by boosting the smp argument.

If you install the Ubuntu Server minimal image, you can then get rid of a bunch of packages you probably won't need:

  dpkg --purge ubuntu-minimal alsa-base alsa-utils linux-sound-base libasound2 aptitude \
            dosfstools eject ethtool hfsplus hfsutils jfsutils \
            libhfsp0 libiw28 wireless-tools memtest86  mii-diag ntpdate pcmciautils \
            reiser4progs reiserfsprogs tcpd usbutils wpasupplicant xfsprogs pciutils \
            netbase

TODO: there's probably a bunch more we could purge.

Install from debootstrap

NOTE: I've gotten pretty far but I haven't gotten this to work yet. The difficulty is that you need to boot the VM to finish the install, and how are you going to boot the VM if the guest doesn't have a kernel? The answer: you need to supply your own kernel, no getting around it. Is there any way to get debootstrap to provide a kernel/initrd combo ready for -kernel and -initrd?

You may want to change the architecture (i386), the distribution (feisty), and the mirror in the following command line. Also, don't pass --foreign. TODO: figure out how to postpone package setup until we're in the virtual machine. That way these instructions will work on all architectures.

  $ mkdir deboot-i386
  $ sudo debootstrap --foreign --arch i386 feisty deboot-i386 http://us.archive.ubuntu.com/ubuntu
  $ sudo rm -rf deboot-i386/var/cache/apt/archives/

Unlike Jail setups (Linux-VServer, openvz), kvm can't boot straight into a subdirectory. Therefore, we must prepare a disk image for it.

  $ qemu-img create -f raw deboot-i386.img 200M
  $ mkfs.ext3 -F deboot-i386.img
  $ mkdir mnt
  $ sudo mount -o loop deboot-i386.img mnt
  $ sudo cp -aR deboot-i386/* mnt

The filesystem grows the image file to 11M, and the files grow it to 53M.

Now we need to place a kernel and a boot block onto the disk image. NOTE: the following is a failed experiment.

NOTE: don't pass --foreign to debootstrap above to continue with these directions! This is a problem because I want to be able to bring up foreign architectures too.

  $ sudo chroot mnt  # TODO: should use chrootuid here
  $ apt-get install linux-image-server
  $ exit

Try to launch with local kernel/initrd

And launch our newly-created disk image with the installed kernel and initrd.

  $ sudo kvm -kernel /vmlinuz -initrd /initrd.img \
         -hda deboot-amd64.img -append root=/dev/hda -snapshot -m 256

Except that it hangs when it tries to read the initrd.

  TCP reno registered
  checking if image is initramfs...it isn't (bad gzip magic numbers); looks like an initrd
  Freeing initrd memory: 7934k freed"

Then hang. this guy ran into the same thing.

Trying the same thing without kvm extensions:

  $ sudo kvm -no-kvm -kernel /vmlinuz -initrd /initrd.img \
         -hda deboot-amd64.img -append root=/dev/hda -snapshot -m 256

Just hangs forever, eating up 100% CPU, at "Booting from Hard Disk".

Try to launch with new kernel/initrd

  $ sudo kvm -kernel deboot-amd64/vmlinuz -initrd deboot-amd64/initrd.img \
         -hda deboot-amd64.img -append root=/dev/hda -snapshot -m 256

hey, a panic!

 unhandled vm exit: 0x8
 rax 0000000000000000 rbx 0000000000000000 rcx 0000000000000000 rdx 0000000000000000
 rsi 0000000000000000 rdi 000000000000062a rsp 0000000000000000 rbp ffff81000062b000
 r8  0000000000000001 r9  0000000000000000 r10 ffff81000062a000 r11 0000000000001000
 r12 000000000000062a r13 000000000062b000 r14 ffff81008062b000 r15 ffff810000c0e304
 rip 000000000000fff0 rflags 00000002
 Aborted (core dumped)

Maybe that's not too much of a surprise since who knows if the installed kernel package managed to prepare its initrd properly. But I think kvm should contain its surprise a little better.

Running the previous command with -no-kvm just hangs at "Booting from Hard Disk" as before.

TODO: add a kernel

  add --include=linux-image-server,grub to debootstrap?

TODO: add a whole bunch of --excludes to debootstrap.

Install Windows

Create a file for the virtual disk drive. If you intend to ever use this machine on Windows, a qcow image uses disk space efficiently on both Windows and Linux. Raw devices are slightly faster and simpler on Linux, and more interoperable, but they take up the full 6G right from the start on Windows. If there's any doubt which format is best for your needs, go ahead and use qcow. You can aways convert later using "qemu-img convert" (see above).

  qemu-img create windows.img -f qcow 6G

Install Windows from CD. If you're installing Win2K, add -win2k-hack to work around a Win2K installer bug. Right now -no-acpi is required to work around qemu's ACPI implementation being only partially ported to KVM's virtual kernel. See the KVM FAQ for more recent news and a workaround if -no-acpi won't work in your case.

  kvm -no-acpi -m 512 -cdrom /dev/cdrom -hda windows.img -boot d

Mounting a Disk Image

NOTE: never ever EVER mount a disk image that is already in use. Generally there's no locking or warning and you are almost certain to corrupt it.

If your disk image is raw, and the entire thing is a filesystem (i.e. you didn't partition the disk image), you can just mount it using the loopback device:

 $ mkdir mnt
 $ sudo mount -o loop img.raw mnt
 $ cd mnt

And play with the disk image. When you're done:

 $ umount mnt
 $ rmdir mnt

If you get trouble about mnt still being busy, it means you haven't closed everything that was working inside the mounted volume. Usually this is a terminal you forgot to close.

If the disk image isn't raw, you can convert it using "qemu-img convert"

To mount a partition within a disk image, you need to supply the offset. For the first partition, the offset will almost certainly be 63*512=32256:

  $ sudo mount -o loop,offset=32256 ha-1.2.raw mnt

If you want to mount another partition, you need to use fdisk to calculate the partition's location. NOTE: If you have the program "kpartx" available - you can just run "kpartx -a img.raw" and kpartx will populate /dev/mapper with relevant entries for the individual partitions.

  $ sudo fdisk img.raw

type x<return>, p<return>. Now you get a table like:

Disk ha-1.2.raw: 255 heads, 63 sectors, 0 cylinders

Nr AF  Hd Sec  Cyl  Hd Sec  Cyl     Start      Size ID
 1 80   1   1    0 254  63  244         63    3935862 83
 2 00   0   1  245 254  63  260    3935925     257040 05
 5 00   1   1  245 254  63  260         63     256977 82

TODO: show an example of how to do this with extended partitions.

Or, even easier, the file command will show you where the primary partitions are located:

 $ file ha-1.2.raw 
 ha-1.2.raw: x86 boot sector;
   partition 1: ID=0x83, active, starthead 1, startsector 63, 3935862 sectors;
   partition 2: ID=0x5, starthead 0, startsector 3935925, 257040 sectors, code offset 0x48

Networking

Networking is hard. KVM networking can be even harder because of the difficulties of connecting guests to host to the outside world. This section presents a few common networking scenarios. If it doesn't happen address your needs, all existing qemu documentation should be relevant for KVM as well.

The one good thing about virtual networking is that there are no problems with loose connections, kinked wires, or crossover cables.

Terminology

VLAN: a virtual network segment. Usually you can picture it as just a virtual switch. You plug virtual NICs into VLANs, and wire VLANs together, and can ultimately create a switch fabric that very closely matches real-world setups.

Just moving files between host and guest

qemu provides a tftp server of the running host. Just launch qemu with -tftp /dir/to/export. Then, on the guest, you can tftp to 10.0.2.2 to get and put files. Make sure you're switching between text and binary mode properly.

Default, Usermode Networking

If you don't specify any networking options, KVM by default constructs a NIC connected to a private VLAN. On this VLAN it also emulates single host that acts as a DHCP server and default router. Any connections initiated by the guest are routed through KVM's private stack and appear to the host computer as requests coming from sockets opened by the KVM process on 127.0.0.1.

KVM's usermode networks typically contain only two addresses: 10.0.2.2 (the virtual host) and 10.0.2.15 (the guest).

Here it is in KVM's language:

 $ kvm -net nic -net user ...

In other words: add a nic and connect it vlan 1. Also add a virtual host connected to vlan 1. The virtual host connects the VLAN via NAT to the physical host.

Because usermode networking is implemented using the SLIRP protocol, UDP is not supported. If you need to move UDP packets to the host or the outside world, you will need to use a different technique.

Connecting VLANs to Each Other

This section describes how to connect virtual machines to each other.

Multiple NICs / VLANs

What if you want to set up a virtual machine with multiple NICs? You need to create multiple VLANs to plug the NICs into. To create more than one VLAN, assign each VLAN a unique ID. For instance, this will create two NICs and two VLANs:

  $ kvm -net nic,vlan=0 -net socket,listen=:8010,vlan=0 -net nic,vlan=1 -net user,vlan=1 ...

Socket Connections

So, I can connect my virtual machines to private VLANs. How do I connect those VLANs together? The easiest is probably using regular TCP sockets. One vlan must be the listener:

  $ kvm -net nic -net socket,listen=:8010 ...

and the other vlan must be the initiator:

  $ kvm -net nic -net socket,connect=127.0.0.1:8010 ...

If you don't specify an address, the listener will listen on all connected interfaces. To only listen on localhost, specify this:

  $ kvm -net nic -net socket,listen=127.0.0.1:8010 ...

This, of course, also allows you to connect virtual lans running on different hosts.

Multicast Sockets

If you want to connect multiple initiators to a single listener, you must use a multicast socket.

  $ kvm -net nic -net socket,mcast=230.0.0.1:1234 ...
  $ kvm -net nic -net socket,mcast=230.0.0.1:1234 ...
  $ kvm -net nic -net socket,mcast=230.0.0.1:1234 ...

That connects 3 different VLANs at the same point. Frames sent on any VLAN will be received by all others.

VDE

Multiple VLANs can also be connected to a single VDE. VDEs are described further in Advanced Networking below. TODO: introduce VDEs here, provide examples.

Connecting VLANs to the Host

This section describes how to connect virtual machines to real networks. We can now create and wire up a huge virtual switching fabric but it's not much use if we can't connect it to the outside world.

The Tap Device

Most (all?) virtual-to-physical connections are made through a tap device. Tap devices are regular network interfaces, not any different from eth0, eth1, lo, etc. One end of the tap is connected to the VLAN, the other end is configured and routed using regular networking tools (ifconfig, route, etc).

  $ kvm -net nic -net tap ...

That command created a new, unique tap ethernet device (tap0, tap1, etc). The /etc/qemu-ifup script is used to provision the new network device. The default /etc/qemu-ifup simply assigns the new interface the IP address 172.20.0.1. You can specify an explicit network name using ifname=IF, and a different script to run using script=SCRIPT, like this:

  $ kvm -net nic -net tap,ifname=qtap0,script=/var/vm/vm0.ifup

Taps cleanly solve the networking problem for a single virtual machine. Unfortunately, each guest requires its own tap device. As you might imagine, this gets unweildy fast.

Advanced Networking

So, how can we run an arbitrary number of virtual machines, all able to talk to each other and the outside world? Alas, there are a huge number of different ways to solve this, all with their own benefits and drawbacks (that's why network engineers get paid the big bucks). Here are some common techniques.

Virtual NICs on VDE, VDE Tap'd to Host, Tap NATed to Outside

This allows guests to initiate connections with each other, the host, and the outside world. It also allows the host to initiate connections with any guest. It doesn't allow the outside world to initiate connections with guests however (although you could manually proxy the connections through the host using kvm's -redir, port forwarding or ssh -L). It's realtively unobtrusive to set up; you don't need to modify the host's network configuration at all.

These steps show how to test out this type of network, but not how to make it persistent. You will have to run these programs manually every time your machine boots.

  • $ sudo sh -c "echo 1 > /proc/sys/net/ipv4/ip_forward"
  • add "tun" to /etc/modules.conf. Also run sudo modprobe tun.
  • $ sudo apt-get install vde dnsmasq
  • $ sudo /etc/init.d/dnsmasq stop
  • $ sudo vde_switch -tap qtap0 -daemon Now vde_switch is listening on /tmp/vde.ctl (use -socket PATH to specify where to put the socket).
  • $ sudo ifconfig qtap0 10.111.111.254 broadcast 10.111.111.255 netmask 255.255.255.0 up
  • $ sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE Make sure to replace eth0 with the interface containing your default route. (TODO: is there any way to automate this? Is there a command that would work no matter what device contains the default route?)
  • $ sudo dnsmasq --log-queries --user=nobody --dhcp-leasefile=/var/tmp/dnsmasq-leasefile --dhcp-range=10.111.111.129,10.111.111.199,255.255.255.0,10.111.111.255,8h --interface=qtap0 --domain=qemu.lan -d TODO: tell how to configure /etc/dnsmasq.conf to do this.
  • $ sudo vdeq kvm -hda v2.qcow -boot c -net nic -net vde -m 192
  • In the guest, put nameserver 10.111.111.254 into /etc/resolv.conf. Also, check that DHCP gave it a sane IP address.
  • In the guest, ping 10.111.111.254 should work. In the host, ping 10.111.111.140 (or whatever the guest's IP address is) should work. If so, the tap device works great.
  • Now, from the guest, try pinging an external IP address. If that works, then masquerading works. Now try pinging an external domain name, like google.com. If that works, congratulations, dnsmasq works and everything should be set up correctly.

TODO: Tell how to make these settings permanent. We need to copy the dnsmasq command line into /etc/dnsmasq.conf, launch vde when /etc/network/interfaces brings up qtun0, but how do we automate adding the MASQUERADE chain? I suppose we need to add and remove this chain from /etc/network/interfaces as well.

Cribbed from the exellent http://alien.slackbook.org/dokuwiki/doku.php?id=slackware:vde

Virtual NICs Bridged Directly to Outside

This technique replaces the host's default network interface with a bridged connection. When you connect guest VLANs to the bridge, they appear to the external network exactly as if they were real. It's the most reliable way to make guests appear to be actual physical machines on the network, but it's also harder to set up and somewhat intrusive.

Be careful! This technique makes your virtual interfaces visible all over the office. For instance, make very sure you give each virtual interface a valid MAC address!

TODO: convert these to Feisty:

http://compsoc.dur.ac.uk/~djw/qemu.html http://kidsquid.com/cgi-bin/moin.cgi/bridge

Performance

memtest86 gives a decent measure of memory bandwidth. Running under qemu gives ~200MB/sec, running under kvm gives ~1220MB/sec. So, kvm runs about 2X slower than the actual memory bandwidth (2320 MB/sec), and qemu is more than 10X slower.

There's very little difference between 32-bit and 64-bit machines. 64 bit machines tend to be 6.5% larger than the equivalent 32-bit machine. TODO: is there a measurable performance difference?

To Do

What's the best way to handle swap in a virtual machine? I could partition my disk images a donate a part of the image to swap, but it would be a shame to have inactive swap stores scattered all over my hard drive. Better would be to set up some sort of tempfs specifically for the guest that automatically goes away when the guest quits. Has anybody automated this?