Refactoring a Dockerfile for Image Size

Update

Since this post, Docker has released improved support for writing complex and still maintainable Dockerfiles. Check out our blog post on multi-stage Docker builds.

Original Post

There’s been a welcome focus in the Docker community recently around image size. Smaller image sizes are being championed by Docker and by the community. When many images clock in at multi-100 MB and ship with a large ubuntu base, it’s greatly needed. Here’s a review of the top 10 image sizes (latest tag) on Docker Hub today:

IMAGE NAME     SIZE  
busybox        1 MB  
ubuntu         188 MB  
swarm          17 MB  
nginx          134 MB  
registry       423 MB  
redis          151 MB  
mysql          360 MB  
mongo          317 MB  
node           643 MB  
debian         125 MB  

A lot of the benefit can be had by simply using a small base image (Alpine Linux, BusyBox, etc). Enough has been written about using these base images, so I assume you’ve already picked a good one. After that, it’s up to the maintainer of the Dockerfile to know some best practices and keep the image size small. Specifically, we’ll examine the image size implications of joining multiple RUN commands onto one line and some practical examples of best practices for apt-get (ie removing the apt-get cache and --no-install-recommends).

Remove cruft in the same Dockerfile line that you added it

Docker images are built from a layered filesystem. Each layer only contain the differences between it and the one below it. At the top, you see a unified view, but the history of how it was built is maintained. Each line in a Dockerfile creates a new layer on top of the existing stack.

For example, let’s start with a Dockerfile snippet that looks like this

ADD https://storage.googleapis.com/golang/go1.5.3.src.tar.gz /tmp

# do some things with that file

RUN rm /tmp/go1.5.3.src.tar.gz  

You might think you are doing a good and responsible thing by deleting the .tar.gz file when you are done. But the layer containing that file is still part of the image. You mask it from the final image with the rm command, but the contents of that .tar.gz file is still are still in the image layer, and will still be downloaded by everyone who docker pulls your image.

It’s better to write it all on one line so it’s not committed to the image as separate layers. For example, a small rewrite of the snippet above would be:

RUN curl -o \  
        /tmp/go.1.5.3.src.tar.gz \
        https://storage.googleapis.com/golang/go1.5.3.src.tar.gz && \
      <do some things with the file> && \
      rm /tmp/go1.5.3.src.tar.gz

It’s not as pretty to look at, but it results in a much more efficient image size. If that line really annoys you, write it in a script, then ADD, RUN it in the Dockerfile.

Remove your apt/yum cache, but do it right!

Most Dockerfile authors know that you should apt-get remove any unecessary packages. One common example is an image that’s built with curl and/or wget to download files. You can apt-get remove curl afterwards, but the layer containing them will remain present in the final image. Remove them (and all auto installed dependencies) in the same Dockerfile line you added them.

This is especially tricky for complex Dockerfiles, so let’s walk through an example.


In practice, let’s see an example

Here’s a simplified version of a typical Dockerfile that might run a python service. Don’t worry, we will optimize this.

FROM ubuntu:14.04  
RUN apt-get update  
RUN apt-get install -y curl python-pip

RUN pip install requests

ADD ./my_service.py /my_service.py  
ENTRYPOINT ["python", "/my_service.py"]  

my_service.py is a python script that simply contains:

#!/usr/bin/python
print 'Hello, world!'  

Time to build and check the image size:

$ sudo docker build -t size .
$ sudo docker images
REPOSITORY      TAG           IMAGE ID            CREATED           VIRTUAL SIZE  
size            latest        da8a9be731ac        4 seconds ago     360.5 MB  
ubuntu          14.04         6cc0fc2a5ee3        2 weeks ago       187.9 MB  

Yikes. The 188 MB base image makes sense from the table above, but we’ve practically doubled the image size to run a hello-world python script. What exactly is being reported in the 360.5 MB number? It’s the total of the “visible” layer (the top one, da8… in my example) and all layers that were used to create this top layer.

Adding a cleanup layer

We should probably clean up after ourselves. Let’s try a Dockerfile that looks like this:

FROM ubuntu:14.04  
RUN apt-get update  
RUN apt-get install -y curl python-pip

RUN pip install requests

## Clean up
RUN apt-get remove -y python-pip curl  
RUN rm -rf /var/lib/apt/lists/*

ADD ./my_service.py /my_service.py  
ENTRYPOINT ["python", "/my_service.py"]  

Building and checking on that yields:

$ sudo docker build -t size .
$ sudo docker images
REPOSITORY      TAG           IMAGE ID            CREATED           VIRTUAL SIZE  
size            latest        c6dacdd00660        2 seconds ago     361.3 MB  
ubuntu          14.04         6cc0fc2a5ee3        2 weeks ago       187.9 MB  

It grew larger (slightly)! Cleaning up after ourselves has backfired!

Cleaning up in the same layer

Let’s try collapsing the apt operations into a single line:

FROM ubuntu:14.04  
RUN apt-get update && \  
    apt-get install -y curl python-pip && \
    pip install requests && \
    apt-get remove -y python-pip curl && \
    rm -rf /var/lib/apt/lists/*

ADD ./my_service.py /my_service.py  
ENTRYPOINT ["python", "/my_service.py"]  

Building and running this version yields:

$ sudo docker build -t size .
$ sudo docker images
REPOSITORY      TAG           IMAGE ID            CREATED           VIRTUAL SIZE  
size            latest        e531f8674f33        9 seconds ago     338 MB  
ubuntu          14.04         6cc0fc2a5ee3        2 weeks ago       187.9 MB  

Ok, that made it smaller. But why is it still so huge? I was expecting a lot less.

More apt-optimizations

It turns out that apt-get install brings along a handful of other “recommended” packages. Recommended packages for apt are simply dependencies that may or may not be required. Some users will require them because of their environment or how they use the package, but it’s not always a requirement.

Running pip on Ubuntu 14.04, it’s very easy to confirm that there are no side effects of removing the recommended packages from this installation. This is something you should definitely test before you ship this off to production. A quick scan of the official packages on Docker Hub show that redis, mysql, mongo, postgres, elasticsearch and more use this technique to make their images smaller.

Let’s try it again with --no-install-recommends in the apt-get.

FROM ubuntu:14.04  
RUN apt-get update && \  
    apt-get install -y --no-install-recommends curl python-pip && \
    pip install requests && \
    apt-get remove -y python-pip curl && \
    rm -rf /var/lib/apt/lists/*

ADD ./my_service.py /my_service.py  
ENTRYPOINT ["python", "/my_service.py"]  

Building and running this version yields:

REPOSITORY      TAG           IMAGE ID            CREATED           VIRTUAL SIZE  
size            latest        fddc30aee4dc        6 seconds ago     229.2 MB  
ubuntu          14.04         6cc0fc2a5ee3        2 weeks ago       187.9 MB  

Ok, that just dropped 120 MB from the image. This looks good.


Create a Dockerfile strategy in your organization to control this. The Dockerfile syntax is easy to learn, but very nuanced when it comes to optimization.

Show Comments