Building small Docker images faster

Posted by steinuil 4 days ago

Comments

Comment by grim_io 3 days ago

I've seen so many devs not know that things like multi stage even exists.

Multi gigabyte containers everywhere.

Comment by zerotolerance 3 days ago

I always like finding people advocating for older sage knowledge and bringing it forward for new audiences. That said, as someone who wrote a book about Docker and has lived the full container journey I tend to skip the containerized build all together. Docker makes for great packaging. But containerizing ever step of the build process or even just doing it in one big container is a bit extra. Positioning it as a build scripting solution was silly.

Comment by maccard 3 days ago

I’m inclined to agree with you about not building containers. That said, I find myself going around in circles. We have an app that uses a specific toolchain version, how do we install that version on a build machine without requiring an SRE ticket to update our toolchain?

Containers nicely solve this problem. Then your builds get a little slow, so you want to cache things. Now your docker file looks like this. You want to run some tests - now it’s even more complicated. How do you debug those tests? How do those tests communicate with external systems (database/redis). Eventually you end up back at “let’s just containerise the packaging”.

Comment by cogman10 3 days ago

You can mount the current directory into docker and run an image of your tool.

Here's an example of that from the docker maven.

`docker run -it --rm --name my-maven-project -v "$(pwd)":/usr/src/mymaven -w /usr/src/mymaven maven:3.3-jdk-8 mvn clean install`

You can get as fancy as you like with things like your `.m2` directory, this just gives you the basics of how you'd do that.

Comment by maccard 2 days ago

Thanks - this is an interesting idea I had never considered. I do like the layer based caching of dockerfiles, which you give up entirely for this but it allows for things like running containerised builds cached SCM checkouts (our repository is 300GB…)

Comment by cogman10 2 days ago

Yeah, it's basically tradeoffs all around.

The benefit of this approach is it's a lot easier to make sure dependencies end up on the build node so you aren't redownloading and caching the same dependency for multiple artifacts. But then you don't get to take advantage of docker build caching to speed up things when something doesn't change.

That's the part about docker I don't love. I get why it's this way, but I wish there was a better way to have it reuse files between images. The best you can do is a cache mount. But that can run into size issues as time goes on which is annoying.

Comment by pstuart 3 days ago

Depending on how the container is structured, you could have the original container as a baseline default, and then have "enhanced" containers that use it as a base and overlay the caching and other errata to serve that specialized need.

Comment by maccard 2 days ago

I’ve tried this in the past, but it pushes the dependency management pf the layers into whatever is orchestrating the container build, as opposed to multi stage builds which will parallelise!

Not dismissing, but it’s just caveats every which way. I think in an ideal world I just want Bazel or Nixos without the baggage that comes with them - docker comes so close but yet falls so short of the finish line.

Comment by yjftsjthsd-h 3 days ago

I quite strongly disagree; a Dockerfile is a fairly good way to describe builds, a uniform approach across ecosystems, and the self contained nature is especially useful for building software without cluttering the host with build dependencies or clashing with other things you want to build. I like it so much that I've started building binaries in docker even for programs that will actually run on the host!

Comment by solatic 3 days ago

It can indeed be uniform across ecosystems, but it's slow. There's a very serious difference between being on a team where CI takes ~1 minute to run, vs. being on a team where CI takes a half hour or even, gasp, longer. A large part of that is the testing story, sure, but when you're really trying to optimize CI times, then every second counts.

Comment by yjftsjthsd-h 2 days ago

If the difference is <1 minute vs >30 minutes, containers (per se) are not the problem. If I was guessing blindly, it sounds like you're not caching/reusing layers, effectively throwing out a super easy way to cache intermediate artifacts and trashing performance for no good reason. And in fact, this is also a place where I think docker - when used correctly - is quite good, because if you (re)use layers sensibly it's trivial to get build caching without having to figure out a per-(language|build system|project) caching system.

Comment by solatic 2 days ago

I'm exaggerating somewhat. But I'm familiar with Docker's multi-stage builds and how to attempt to optimize cache layers. The first problem that you run into, with ephemeral runners, is where the Docker cache is supposed to be downloaded from, and it's often not faster at all compared to re-downloading artifacts (network calls are network calls, and files are files after all). This is fundamentally different from per-language caching systems where libraries are known to be a dumb mirror of upstream, often hash-addressed for modern packaging, and thus are safe to share between builds, which means that it is safe to keep them on the CI runner and not be forced to download the cache for a build before starting it.

> without having to figure out a per-language caching systems

But most companies, even large ones, tend to standardize on no more than a handful of languages. Typescript, Python, Go, Java... I don't need something that'll handle caching for PHP or Erlang or Nix (not that you can really work easily with Nix inside a container...) or OCaml or Haskell... Yeah I do think there's a lot of room for companies to say, this is the standardized supported stack, and we put in some time to optimize the shit out of it because the DX dividends are incredible.

Comment by yjftsjthsd-h 2 days ago

I really don't see how that's different at all, certainly not fundamentally. You can download flat files over the network, and you can download OCI image layers over the network. I'm pretty sure those image layers are hash-addressed and safe to share between builds, too, and you should make every effort to keep them on the CI runner and reuse them.

Comment by maccard 3 days ago

You can have fast pipelines in containers - I’ve worked in quick containerised build environments and agonisingly slow non-containerised places, the difference is whether anyone actually cares and if there’s a culture of paying attention to this stuff.

Comment by solatic 3 days ago

Agree, and I would go another step to suggest dropping Docker altogether for building the final container image. It's quite sad that Docker requires root to run, and all the other rootless solutions seem to require overcomplicated setups. Rootless is important because, unless you're providing CI as a public service and you're really concerned about malicious attackers, you will get way, way, way more value out of semi-permanent CI workers that can maintain persistent local caches compared to the overhead of VM enforced isolation. You just need an easy way to wipe the caches remotely, and a best-effort at otherwise isolating CI builds.

A lot of teams should think long and hard about just taking build artifacts, throwing them into their expected places in a directory taking the place of chroot, generating a manifest JSON, and wrapping everything in a tar, which is indeed a container.

Comment by OptionOfT 2 days ago

I like to build my stuff inside of Docker because it is my moat against changes of the environment.

We have our base images, and in there we install dependencies by version. That package then is the base for our code build. (as apt seemingly doesn't have any lock file support?).

In the subsequent built EVERYTHING is versioned, which allows us to establish provenance all the way up to the base image.

And next to that when we promote images from PR -> main we don't even rebuild the code. It's the same image that gets retagged. All in the name of preserving provenance.

Comment by solatic 2 days ago

You can still use a base image; you download it from the registry, extract the tar, then add your build artifacts before re-generating a manifest and re-tarring. If you specify a base image digest, you can also use a hash-addressed cache and share it between builds safely without re-downloading.

Once you have your container image, how you decide to promote it is a piece of cake, skopeo doesn't require root and often doesn't require re-pulling the full tar. Containerization is great, I'm specifically trying to point out that there are alternatives to Docker.

Comment by miladyincontrol 2 days ago

I mean personally I find nspawn to be a pretty simple way of doing rootless containers. Replace manifest JSON with a systemd service file and you've got a rootless container that can run on most linux systems without any non-systemd dependencies or strange configuration required. Dont even need to extract the tarball.

Comment by dboreham 3 days ago

Agree. Using a container to build the source that is then packaged as a "binary" in the resulting container always seemed odd to me. imho we should have stuck with the old ways : build the product on a regular computer. That outputs some build artifacts (binaries, libraries, etc). Docker should take those artifacts and not be hosting the compiler and what not.

Comment by rcxdude 3 days ago

If anything the build being in a container is the more valuable bit, though mainly because the container usually more repeatable by having a scripted setup itself. Though I dunno why the build and the host would be the _same_ container in the end.

(and of course, nix kinda blows both out the water for consistency)

Comment by exe34 3 days ago

nix allows you to build docker containers with anything you can build in nix.

Comment by rapidlua 3 days ago

For go specifically, I find ko-build handy. It builds on the host (leveraging go crosscompilation and taking advantage of caches) and outputs a Docker image.

Comment by paulddraper 3 days ago

A Bazel option is https://github.com/bazel-contrib/rules_oci

Doesn’t even need Docker, just writes the image files with a small Python script.

Can build from scratch, or use the very small Distroless images.

Comment by lrvick 3 days ago

For even smaller images that are always deterministic/reproducible with a multi-party signed supply chain, check out https://stagex.tools

Comment by abound 3 days ago

Might want to disclose that you built it.

Also, I took a quick look and I don't understand how your tool could possibly produce "even smaller images". The article is using multi-stage builds to produce a final Docker image that is quite literally just the target binary in question (based on the scratch image), whereas your tool appears be a whole Linux distribution.

Comment by lrvick 2 days ago

I am one of the maintainers at this point, fair.

This would be a much smaller drop in replacement for the base images used in the post to give full source bootstrapped final binaries.

You can still from scratch for the final layer though of course and that would be unlikely to change size much though, to your point.