Publishing virtual machines

We recently had a requirement to publish a virtual machine through the data repository - combining the software to process the data (XUbuntu, python, R etc.) and the data itself. This allows all necessary libraries to be published in consistent versions so that users can just get on with using the data. All users need is a copy of VirtualBox and to install the image.

All well and good. However, the uncompressed virtual machine was 7.4 GB which is way above the 2GB limit for DSpace to cope with. Compressing it with zip improves the situation somewhat - creating a 4.2 GB archive. This is still "a miss is as good as a mile" territory as far as publishing the data is concerned. This post explains how we got that down to 1.7 GB with minimal changes to the useful content of the archive.

It is possible to compact a dynamically expanding VirtualBox images using the VBoxManage utility. This will shrink the image down by removing any blocks that are simply zeroes. This sounds like what we need. There are two options that support this:

  • VBoxManage modifyhd <disk image> --compact

  • VBoxManage clonehd <input disk image> <destination disk image>

The modifyhd --compact option will only compact VirtualBox VDI images, clonehd can operate on other image formats using the --format option.

However, remember that this only removes blocks which are all zeroes. Usually when files are modified/deleted, blocks are simply marked as unused rather than being overwritten with zeroes - so there will be unused space in the disk image which contains non-zero data. We therefore need to overwrite these unused areas with zeroes.

Fortunately, suitable utilities exist to put zeroes where we need them.

  • On Ubuntu, we can use zerofree, but need to boot the VM in recovery mode so that the virtual (boot) disk can be modified

  • On Windows, we can use sdelete

The zerofree utility is included in Parted Magic. Parted Magic is distributed as an ISO disk image fior installing on a Live CD or USB drive. However, with Virtualbox we can simply mount the ISO image on the virtual machines's logical CD drive and then start the machine - this will boot into Parted Magic and allow us to run zerofree.

  • Start Parted Magic VM

  • Launch a terminal session

  • zerofree /dev/sda1 or whatever your disk devices are (ls /dev/sd* for hints)

  • Logout of Parted Magic and shutdown the (virtual) computer

  • Run VBoxManage modifyhd <image file> --compact

An example VDI had an 8GB dynamically allocated disk image which occupied 3.86 GB of hard disk space. This reduced to 3.4 GB after zerofree and
VBoxManage modifyhd <image> --compact. Another virtual system had a 32 GB drive (16 GB actual) and a 16 GB drive (5.4 GB actual). Sadly the larger of these disks was still 16 GB after tidying up (which I should have expected as df showed that 15 GB was in use!), but the smaller reduced to 2.5 GB.

Zeroing unused blocks and then compacting the VDI for our real image, we got the zip file down to 2.8 GB - a third less space required!

We also managed to free another 400 MB by tidying up the OS before compacting the image - remembering to zero the freshly freed space after the tidy up! In particular, we cleared out the APT package library:

  • apt-get remove <package> - remove packages which aren't relevant (e.g. Libre Office)

  • apt-get autoremove - clear out modules that were installed as dependencies but are no longer needed

  • apt-get autoclean - clear out package files for items that aren't installed

To check the space used by APT, use du:

  • du -sh /var/cache/apt

A quick glance on my desktop machine reveals a 153MB of packages on here, and a total of 225MB of APT data. As I have a high-speed network connection, there's really no need to keep those files locally. Curiously, autoremove would happily "remove" 43Mb of files, but autoclean had no affect on the repository size.

The deborphan utility found 86 packages with no dependencies on the system, which were then removed using:

  • sudo deborphan | xargs sudo apt-get -y remove --purge

As a more serious option, we can use:

  • apt-get clean - clear out all package files from the local repository

This reduce the space used on the desktop machine to 452 KB from 225 MB. A pretty decent space saving.

Additionally, your system may contain many locale-specific options. Using du -sh /usr/share/locale reveals 160 MB of data on this desktop machine. Installing the localepurge utility allows your system to only keep locale information which is relevant for you - you specify which locales to keep and it removes other information.

localepurge: Disk space freed in /usr/share/locale: 157556 KiB
localepurge: Disk space freed in /usr/share/man: 6352 KiB
localepurge: Disk space freed in /usr/share/gnome/help: 42872 KiB
localepurge: Disk space freed in /usr/share/omf: 1800 KiB
localepurge: Disk space freed in /usr/share/doc/kde/HTML: 1180 KiB

So that's my /usr/share/locale down to less than 8MB of data plus another 50MB-ish of other locale-specific information removed. So about 200 MB of locale-specific data discarded.

Tidying up the disk before zeroing the empty space, compacting the VDI and zipping it down saved a further 400 MB, getting us down to 2.4 GB. So near, but yet... still not enough.

What else can we do ? Fortunately we found an answer. The standard Ubuntu zip utility is fast and does a decent job. However, it's not the most aggressive compression utility out there. A much better compression-ratio can be achieved by using 7-Zip, of course this is a more compute-intensive but for releasing an item it's a worthwhile trade-off (core 7-Zip is a Windows application, the Ubuntu/Debian package for a command-line version of 7-Zip is p7zip (install using apt-get), for Mac OS X, keka uses the 7-Zip engine) . Using 7-zip we got the 2.4 GB archive down to 1.7 GB. Finally a file small enough to get onto DSpace!

So, the complete process involves:

  • Remove any unnecessary software and configuration data (e.g. on Linux remove package and locale-specific information using deborphan, apt-get clean, localepurge)

  • If it's Windows, then defragment the disk image

  • Zero any empty blocks (using zerofree on Linux, sdelete on Windows)

  • Compact the tidied image (using VBoxManage)

  • Compress the image (using 7-Zip)

There is a question as to whether it's actually worthwhile compacting the disk image after the free space is zeroed - zipping the file should produce a similar size archive either way. However, for an end-user a smaller footprint for the virtual machine has to be better.

Links used: