Speed up Docker image builds with prebuilt base images

Posted on February 18, 2019 by wjwh


Building Docker images is an increasingly popular way to package applications into an easy-to-deploy format. The Docker tools include a handy layer cache that can speed up image building significantly. However, the layer cache does not play all that well with package managers like Bundler and Yarn. Sometimes the cache is not available at all, such as in certain hosted CI environments (or only as a paid addon). In this post, I will describe a method we used to decrease the build time for Docker images of one of our Ruby applications from over 5 minutes to 10-20 seconds.

Package managers and docker caching

A minimal Dockerfile for a Ruby app looks something like this:

FROM ruby:2.5.1
COPY . .
RUN bundle install
EXPOSE 9292

The above Dockerfile has a problem though: whenever any file changes, this will cause all the gems to be re-installed. Especially for gems that contain large C extensions, this can take significant time since all the C code has to be compiled. Some common advice is to separate out the gem installation phase into a separate layer, like this:

FROM ruby:2.5.1
COPY Gemfile Gemfile.lock ./
RUN bundle install 
COPY . .
EXPOSE 9292

This makes better use of the Docker build cache, since changes to the app code without changes to the Gemfile will not cause all the gems to be rebuilt. Any updates to the Gemfile or the Gemfile.lock will still cause every single gem to be reinstalled however, instead of just the gem that was updated. It also does not help at all if the Docker build cache is not available for some reason, such as in some CI environments. For a medium size Rails app, installing all the gems (and apk packages) took more than 5 minutes.

Pre-built images to the rescue!

The root problem is that the docker build cache works on a per-command basis, on the assumption that one command only does a single thing. The RUN bundle install command does not operate well with this assumption, since that one command can install potentially hundreds of dependencies. Changing one line in the Gemfile will cause the entire layer to be invalidated, instead of just the gems that were actually updated. This is extremely wasteful, since Bundler will usually not reinstall any packages that are already present. One possible remedy is to separate the gem installation phase into a separate image that then becomes the starting point of the ‘actual’ image building process:

# Dockerfile.base, tag this as myapp-base
FROM ruby:2.5.1
# If you need additional apt packages, those can be installed here as well
COPY Gemfile Gemfile.lock ./
RUN bundle install 
# Dockerfile
FROM myapp-base
COPY Gemfile Gemfile.lock ./
# will only update gems that have changed since building the base image
RUN bundle install
COPY . .
EXPOSE 9292

In this way, even if you update one or more gems Bundler only has to reinstall the gems that were updated. If nothing has changed, the bundle install command is a no-op that will be fast even without the Docker layer cache. Building a ‘final’ image for the same medium size rails app using this method only takes 10-20 seconds, depending on if any gems were updated since the base image was built.

Dealing with updates

One potential weakness of the prebuilt image system is that it will get more and more out of date as gems get updated. At work, we use the Depfu service to keep our dependencies updated and for some of our larger services we might update gems multiple times per week or even per day. If we don’t update the base image (or not often), it would get out of date fairly quickly and we would lose much of the benefits of our base image.

To solve this, we run a daily cron job to update the base images with the latest gems. This essentially shifts the gem installation time to a time where we don’t care about it (ie at night) and also reuses the gem installation for all the image builds performed that day. Some companies have opted for a company wide base image for all their services, while we opted for a different approach with a per-service base image. There are benefits and drawbacks to both situations, and you should choose whatever fits your situation best.

Conclusion

The Docker layer cache is pretty great when it works, but it does not integrate well at all with package managers such as Bundler and Yarn (and presumably others, but I have no first hand knowledge of those). There are also situations where it is not available at all, such as in hosted CI environments like Travis CI and others. By building a base image we can radically increase build speed at the cost of only a small increase in complexity.