CoreOS Disk Image Rehydrator
Part of implementing https://github.com/coreos/fedora-coreos-tracker/issues/828 which is in turn part of https://github.com/openshift/enhancements/pull/201
In CoreOS we ship a lot of disk images, one for each platform. These are almost entirely the same thing, mostly just with an ignition.platform.id
stamp in each disk image, wrapped in a hypervisor/platform specific container.
Plus for bare metal, we ship both a "split out" PXE setup as well as an ISO, which are also the same thing.
Today, users access this via stream metadata as documented here: https://docs.fedoraproject.org/en-US/fedora-coreos/stream-metadata/
The goal of this project is to make our build and delivery pipeline more "container native" by creating a container image that de-duplicates all of those images, and can generate them on-demand.
Specific goals:
- Make offline mirroring much easier because an administrator can just mirror this container image along with the other containers they want
- Orient our release engineering to be more "container native" in general
- Ensure this container image is signed, which then in turn means we have a signature that covers all of our disk images.
Try it now!
An image is uploaded to quay.io/cgwalters/fcos-images:v0.1.1
(also tagged stable
). For example here we extract the OpenStack image:
$ podman run --rm -i quay.io/cgwalters/fcos-images:v0.1.1 rehydrate - --disk openstack > fedora-coreos-openstack.x86_64.qcow2
And now you can e.g. upload this image with glance.
There's more artifacts, for example use --iso
to get the metal
live ISO.
We're using -
to output to stdout, because it's more convenient than dealing with podman bind mounts. You can also use e.g. podman run --rm -i -v .:/out:Z quay.io/cgwalters/fcos-images:v0.1.1 rehydrate /out --disk openstack
to write to a directory that was bind mounted from the host.
When extracting multiple things to stdout (e.g. --iso --disk qemu
to get both the ISO and qemu.qcow2
, or --pxe
) then the output stream will be a tarball which you can extract via piping to tar xf -
.
Use --help
to see other commands. Notice in the above invocation, we chose the filename, and we also don't have the version number. This information can currently be retrieved via the print-stream-json
command, which outputs the stream metadata stored in the image.
As of right now for Fedora CoreOS stable
, the original (compressed) images total 8.34GiB
, and the container image is 1.77GiB
, so a savings of nearly 80%
which isn't bad. Most importantly, adding new platforms will only incur a space hit of ~20MiB
and not !700MiB
. But there's more we can do here - see the issues list for details on ideas.
We also aren't including all the images; e.g. vmware
is doable but needs some ova
handling. The more images we include here, the better the overall compression ratio will look.
Goal: Bit-for-bit uncompressed SHA-256 match
Our CI tests the disk images. For example, we have extensive tests that exercise the live .iso
, etc.
In order to ensure that we're re-generating what we tested, ideally the uncompressed SHA-256 checksum matches.
However, this is not currently done for all formats.
Image differences
The -openstack.qcow2
and the -qemu.qcow2
only differ in the ignition.platform.id
in the boot partition that is written by https://github.com/coreos/coreos-assembler/blob/master/src/gf-platformid
However, the way we replace this also causes e.g. filesystem metadata (extents, timestamps) to change.
Plus, on s390x we need to rerun zipl
which changes another bit of data.
Compression
We need to get this out of the way: compression gets very hard to reproduce. See https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html
Our initial goal will be to generate uncompressed images, and verify the uncompressed SHA-256 matches. That's all we need to be sure we've generated the same thing - we don't need to replicate the compression exactly. We can trust our compression tools.
Hence, while we ship e.g. -qemu.qcow2.xz
(or .gz
for RHCOS currently), we will primarily generate e.g. -qemu.qcow2
and for callers that want it compressed, we may pass it to gzip -1
or so.
First approach: Add the -qemu.qcow2 image and use rsync to regenerate most images
A lowest common denominator to de-duplicate these is rsync-style rolling checksums. This approach is also used by ostree "baseline" deltas, although it can also use bsdiff.
Other approach: Reuse oscontainer content
We need to have a separate container image from machine-os-content
in the RHCOS case, because today any change to machine-os-content
will cause nodes to update on all platforms. But we don't want to force every machine to reboot just because we needed to respin the vsphere OVA!
However, what may work well is to have our image do FROM quay.io/openshift/machine-os-content@sha256:...
Today the rhcos oscontainer is an "archive" mode repo (each file individually compressed). In the future with ostree-ext containers we'll have an uncompressed repo, which will be easier to use as a deduplication source.
Either way, what may work is to e.g. generate a delta from "oscontainer stream" (tarball e.g.) to the metal/qcow2 image, and then deltas from that to other images.
Related: osmet?
https://github.com/coreos/coreos-installer/blob/master/docs/osmet.md
We could be much smarter about our deltas with an osmet-style approach. We even ship the ISO which has osmet for the metal images inside it.
So a simple approach could be to ship the ISO as the basis, and launch it in qemu to have it generate the metal image. Then we use the metal image as an rsync-style rolling basis for everything else.
See below for more on the ISO.
VMDK
AWS and vSphere are "VMDK" images which have internal compression. cosa today uses an invocation like this:
$ qemu-img convert -O vmdk -f qcow2 -o adapter_type=lsilogic,subformat=streamOptimized,compat6 fcos-qemu.qcow2 fcos-aws.vmdk
And this streamOptimized
bit seems to turn on internal compression. Further, there's a bit of random data generated during this process: https://github.com/qemu/qemu/blob/266469947161aa10b1d36843580d369d5aa38589/block/vmdk.c#L2519 (Which will be handled by the rsync-style delta)
For now, let's hardcode the qemu options for these two here again. But longer term perhaps we fork off qemu-img info
to try to gather this, or change coreos-assembler to include the qemu-img
options used to generate the image.
ISO
The structure of the ISO is mostly a wrapper for images/pxeboot/rootfs.img
which is a CPIO blob which contains a squashfs
plus the osmet glue.
Reproducing squashfs bit may require using a fork:
What is "rehydration"?
A tip of the hat to https://en.wikipedia.org/wiki/The_Three-Body_Problem_(novel) where an alien species can "dehydrate" during times of crisis to a thinned-out version that can be stored, then "rehydrate" when it's over.