Refactoring a Dockerfile for image size

enter image description here

There’s been a welcome focus in the Docker community recently around image size. Smaller image sizes are being championed by Docker and by the community. When many images clock in at multi-100 MB and ship with a large ubuntu base, it’s greatly needed. Here’s a review of the top 10 image sizes (latest tag) on Docker Hub today:

IMAGE NAME     SIZE
busybox        1 MB
ubuntu         188 MB
swarm          17 MB
nginx          134 MB
registry       423 MB
redis          151 MB
mysql          360 MB
mongo          317 MB
node           643 MB
debian         125 MB

A lot of the benefit can be had by simply using a small base image (Alpine Linux, BusyBox, etc). Enough has been written about using these base images, so I assume you’ve already picked a good one. After that, it’s up to the maintainer of the Dockerfile to know some best practices and keep the image size small. Specifically, we’ll examine the image size implications of joining multiple RUN commands onto one line and some practical examples of best practices for apt-get (ie removing the apt-get cache and --no-install-recommends) .

Remove cruft in the same Dockerfile line that you added it

Docker images are built from a layered filesystem. Each layer only contain the differences between it and the one below it. At the top, you see a unified view, but the history of how it was built is maintained. Each line in a Dockerfile creates a new layer on top of the existing stack.

For example, let’s start with a Dockerfile snippet that looks like this

ADD https://storage.googleapis.com/golang/go1.5.3.src.tar.gz /tmp

# do some things with that file

RUN rm /tmp/go1.5.3.src.tar.gz

You might think you are doing a good and responsible thing by deleting the .tar.gz file when you are done. But the layer containing that file is still part of the image. You mask it from the final image with the rm command, but the contents of that .tar.gz file is still are still in the image layer, and will still be downloaded by everyone who docker pulls your image.

It’s better to write it all on one line so it’s not committed to the image as separate layers. For example, a small rewrite of the snippet above would be:

RUN curl -o \
        /tmp/go.1.5.3.src.tar.gz \
        https://storage.googleapis.com/golang/go1.5.3.src.tar.gz && \
      <do some things with the file> && \
      rm /tmp/go1.5.3.src.tar.gz

It’s not as pretty to look at, but it results in a much more efficient image size. If that line really annoys you, write it in a script, then ADD, RUN it in the Dockerfile.

Remove your apt/yum cache, but do it right!

Most Dockerfile authors know that you should apt-get remove any unecessary packages. One common example is an image that’s built with curl and/or wget to download files. You can apt-get remove curl afterwards, but the layer containing them will remain present in the final image. Remove them (and all auto installed dependencies) in the same Dockerfile line you added them.

This is especially tricky for complex Dockerfiles, so let’s walk through an example.


In practice, let’s see an example

Here’s a simplified version of a typical Dockerfile that might run a python service. Don’t worry, we will optimize this.

FROM ubuntu:14.04
RUN apt-get update
RUN apt-get install -y curl python-pip

RUN pip install requests

ADD ./my_service.py /my_service.py
ENTRYPOINT ["python", "/my_service.py"]

my_service.py is a python script that simply contains:

#!/usr/bin/python
print 'Hello, world!'

Time to build and check the image size:

$ sudo docker build -t size .
$ sudo docker images
REPOSITORY      TAG           IMAGE ID            CREATED           VIRTUAL SIZE
size            latest        da8a9be731ac        4 seconds ago     360.5 MB
ubuntu          14.04         6cc0fc2a5ee3        2 weeks ago       187.9 MB

Yikes. The 188 MB base image makes sense from the table above, but we’ve practically doubled the image size to run a hello-world python script. What exactly is being reported in the 360.5 MB number? It’s the total of the “visible” layer (the top one, da8… in my example) and all layers that were used to create this top layer.

Adding a cleanup layer

We should probably clean up after ourselves. Let’s try a Dockerfile that looks like this:

FROM ubuntu:14.04
RUN apt-get update
RUN apt-get install -y curl python-pip

RUN pip install requests

## Clean up
RUN apt-get remove -y python-pip curl
RUN rm -rf /var/lib/apt/lists/*

ADD ./my_service.py /my_service.py
ENTRYPOINT ["python", "/my_service.py"]

Building and checking on that yields:

$ sudo docker build -t size .
$ sudo docker images
REPOSITORY      TAG           IMAGE ID            CREATED           VIRTUAL SIZE
size            latest        c6dacdd00660        2 seconds ago     361.3 MB
ubuntu          14.04         6cc0fc2a5ee3        2 weeks ago       187.9 MB

It grew larger (slightly)! Cleaning up after ourselves has backfired!

Cleaning up in the same layer

Let’s try collapsing the apt operations into a single line:

FROM ubuntu:14.04
RUN apt-get update && \
    apt-get install -y curl python-pip && \
    pip install requests && \
    apt-get remove -y python-pip curl && \
    rm -rf /var/lib/apt/lists/*

ADD ./my_service.py /my_service.py
ENTRYPOINT ["python", "/my_service.py"]

Building and running this version yields:

$ sudo docker build -t size .
$ sudo docker images
REPOSITORY      TAG           IMAGE ID            CREATED           VIRTUAL SIZE
size            latest        e531f8674f33        9 seconds ago     338 MB
ubuntu          14.04         6cc0fc2a5ee3        2 weeks ago       187.9 MB

Ok, that made it smaller. But why is it still so huge? I was expecting a lot less.

More apt-optimizations

It turns out that apt-get install brings along a handful of other “recommended” packages. Recommended packages for apt are simply dependencies that may or may not be required. Some users will require them because of their environment or how they use the package, but it’s not always a requirement.

Running pip on Ubuntu 14.04, it’s very easy to confirm that there are no side effects of removing the recommended packages from this installation. This is something you should definitely test before you ship this off to production. A quick scan of the official packages on Docker Hub show that redis, mysql, mongo, postgres, elasticsearch and more use this technique to make their images smaller.

Let’s try it again with --no-install-recommends in the apt-get.

FROM ubuntu:14.04
RUN apt-get update && \
    apt-get install -y --no-install-recommends curl python-pip && \
    pip install requests && \
    apt-get remove -y python-pip curl && \
    rm -rf /var/lib/apt/lists/*

ADD ./my_service.py /my_service.py
ENTRYPOINT ["python", "/my_service.py"]

Building and running this version yields:

REPOSITORY      TAG           IMAGE ID            CREATED           VIRTUAL SIZE
size            latest        fddc30aee4dc        6 seconds ago     229.2 MB
ubuntu          14.04         6cc0fc2a5ee3        2 weeks ago       187.9 MB

Ok, that just dropped 120 MB from the image. This looks good.


Create a Dockerfile strategy in your organization to control this. The Dockerfile syntax is easy to learn, but very nuanced when it comes to optimization.

Update

enter image description here

We open sourced an opinionated Dockerfile Linter that you can integrate into your build process & try out at www.fromlatest.io.

5 thoughts on “Refactoring a Dockerfile for image size

  1. Images for Ubuntu, including ubuntu-debootstrap, come with some cruft that bloates theirs size. For example, you don’t need *mkfs.ext2* in a Docker image for general purposes. Someone clearly dropped the ball here. If you like, go the extra mile and actually build a base-image that is tailored to your needs.

    For example, *apt-get -y install A B curl D* and *apt-get -y install A B curl D E* might result in compact two layers, but still contain repetition which will need to be downloaded over and over again (the resulting layers).

    On the other hand, trying to utilize caching by using *apt-get -q update && apt-get -y install A && apt-get clean” for every A,B,curl,D will result in many, many layers, up to the limit of overlayfs. (And are cached worse than the *.deb* files themselves.)

    Anyway, here’s my base images and how I obtained them: https://hub.docker.com/r/blitznote/debootstrap-amd64/

    Liked by 1 person

  2. openSUSE official images are also less than 100MB, being smaller than debian, ubuntu and fedora.

    REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
    debian latest 7a01cc5f27b1 13 days ago 125.1 MB
    ubuntu latest 6cc0fc2a5ee3 2 weeks ago 187.9 MB
    alpine latest 2314ad3eeb90 2 weeks ago 4.79 MB
    busybox latest b175bcb79023 3 weeks ago 1.114 MB
    fedora latest 3fc68076e184 4 weeks ago 206.3 MB
    opensuse latest bca2ad8ee9a4 8 weeks ago 96.14 MB

    Like

    • I totally and completely agree. If you can use Alpine linux as your base layer, do it. It’s trivially small and apk is a pretty decent package manager. We use Alpine and Docker Hub is moving the official images over to Alpine also.

      Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s