Let's go easy on PyPI, OK?

2024-09-28
python web tools performance
8 min read

[Post updated Sept 30, 2024 based on lots of community conversations.]

What do you think of when you visualize the Python Package Index and someone using it? A developer or data scientist building something new and creative? That’s cute (and yes of course it’s amazing and it does happen).

Closer to reality

For every individual periodically installing packages with pip and friends, there are literally 1,000s or hundreds of thousands of automated systems building Docker images (or containers for Kubernetes or other infrastructure).

Maybe it should look more like this in your mind, an absolute hive of traffic and busy-ness and few humans in the loop at all:

This is because containers are supposed to be isolated. So of course they have to download flask-sqlalchemy 100 times day, right?

Or do they?

Containers also support caching and layering. That is, each line of the file that creates a container is a separate instruction that adds on the ones that came before. And if nothing the came before has changed, then why run it again?

We can use this layering to make things much easier and cheaper for Python and PyPI.

Did you know that PyPI had a total of 284+ billion downloads for the half million projects in 2023? That’s $50k - $100k of traffic per month. Yikes.

What if we could get faster Docker builds and dramatically limit the demand on PyPI? I’ll show you a technique that we use at Talk Python (I’m sure there are variations).

Towards a gentler Dockerfile

Consider this Docker image file (don’t worry if you don’t know Docker, I’ll explain and what you need to know is simple). This is the most abusive one, yet it’s the straightforward and common one as well:

FROM ubuntu:latest

# Set up the path for our tooling
ENV PATH=/venv/bin:$PATH
ENV PATH=/root/.cargo/bin:$PATH

# Install uv (because it's caching is awesome)
RUN curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a venv with Pyton 3.12.6, installing Python along the way
RUN uv venv --python 3.12.6 /venv

# Copy the full source code to be run
COPY src/yourapp /app

# Install the requirements via uv & requirements.txt
RUN uv pip install -r requirements.txt

# Run your app in a web server via the entry point
ENTRYPOINT [  \
    "/venv/bin/..." \
    ]

First, for you all uninitiated in Docker, the steps are:

Base a container on Ubuntu
Set up some path to keep the tooling simple later
Use uv over basic pip and install Python and create a venv with it
Copy your source code from a local src folder into /app
Install the requirements (via uv) into your venv
Run the app via some web server or something on the entry point

Note: If you’re wondering what’s up with all this uv stuff, I wrote about it a week or two ago.

Why is this abusive?

If any source file whatsoever changes, that will trigger the lines below

# Copy the full source code to be run

to run and hence invalidating all of Docker’s caching and will reinstall all of the requirements.

Over at Talk Python Training, our web app has 205 packages in use (not top-level, but with the transitive closure of them). 205! That is massive load on pypi.org. Massive. And yet, most of the time, no packages changed. This is especially true if you pin your dependencies as we do with uv pip compile. But we would still reinstall all of them, pulling fresh copies from PyPI through Fastly every time with this docker setup.

That is down right abusive and wasteful.

A simple change for the better

One very minor change (but not final) that we can do is to just copy the requirements.txt file and install them first, then overwrite that with the full app.

See the NEW section:

# ...

# NEW: Copy just the requirements.txt and install them
COPY src/yourapp/requirements.txt /app
RUN uv pip install -r requirements.txt

# Copy the full app to be run
COPY src/yourapp /app

# ...

Seems silly, right? If we are going to copy all the files anyway, why copy just the requirements.txt one first?

Because of Docker’s cached layers.

This change means you only install the requirements if the requirements.txt file itself has changed (or lines before it such as a new ubuntu image).

Way better! Now, most source changes don’t pull a new copy of requirements from PyPI and the container builds faster: win-win.

But…

If any requirement changes, we pull all of them by rerunning uv pip install -r requirements.txt Ugh. It’s better, but not great.

One more layer

Let’s add one more change. Suppose your app is built on Flask and uses Beanie for an ODM against MongoDB. So your core requirements are flask, beanie, and that’s it. And let’s imagine your web server is gunicorn to boot.

We can add this new-new section:

# ...

# NEW-NEW: Install the latest of all main dependencies.
# Do NOT pin these versions, we want the latest here
RUN uv pip install flask beanie gunicorn

# Copy just the requirements.txt and install them
COPY src/yourapp/requirements.txt /app
RUN uv pip install -r requirements.txt

# Copy the full app to be run
COPY src/yourapp /app

# ...

That new line makes all the difference:

uv pip install flask beanie gunicorn

This layer will never change, never. Yet, it’ll pull every dependency if you put your top-level ones in that line. But you may be thinking:

Great, however it’ll get out of sync over time…

Yes, but think 205 dependencies, not 3: At first build time, your container (well uv) will see that you already have all the dependencies when you run

uv pip install -r requirements.txt

It won’t hit PyPI again on subsequent builds and it’ll run lightening fast because of uv caching and Docker’s caching on top of that.

Over time, they will drift. Maybe a new version of pydantic ships and it’ll get changed when the requirements.txt line is run. The older, cached one will get uninstalled when you install the pinned requirements.txt (or pyproject.toml, or whatever).

Fine. But this only pulls the new pydantic from PyPI, not all 205 of the dependencies!

My experience is most packages stay static for months if not years and only a few change frequently. So why are we pulling all 205 from PyPI every time something minor changes. We shouldn’t be, right?

Eventually there will be something to trigger a full rebuild (such as a new ubuntu image ships). Then everything will updated, get resynced and realigned.

And the cycle continues.

And potentially some help from Docker

[This section added after lots of great feedback from readers.]

So far, we have cached our packages in the uv cache inside the container. This layering from Docker is great and will reuse that cache most of the time as mentioned above. However, for full rebuilts, we will pull full copies of everything again. This is the case if we ask for a full rebuild but also if we have the base image or lines prior to ours change.

Docker will optionally use a host-OS folder for a subsection of the docker image file system during build with the mount option. We can leverage this to permanently locate and use the uv cache on the host OS so even full rebuilds mostly use the local cache. Notice the addtion of --mount=type=cache,target=/root/.cache:

# ...

# Copy just the requirements.txt and install them
COPY src/yourapp/requirements.txt /app
RUN --mount=type=cache,target=/root/.cache uv pip install -r requirements.txt

# Copy the full app to be run
COPY src/yourapp /app

# ...

This will create a local cached folder managed by Docker that maps /root/.cache inside the container for build time that redirects uv’s cache (/root/.cache/uv) as well as pip’s cache (/root/.cache/pip) to a local persistant host folder.

In this setup, you can skip the ahead-of-time RUN uv pip install flask beanie gunicorn section because after the first run it’ll be cached to the OS. That’s a big plus.

However, now the host OS will be the source of caching for pip/uv with this setup. The packages will not be cached in the docker image. So if you ship the container to other locations that build upon it, then they’ll be starting from scratch. This would be the case if someone is docker pull-ing your image as their base image. So consider the use-case for --mount.

And for CI on GitHub

One the largest consumers of pypi.org traffic has to be CI. And among them, GitHub is probably the biggest. This is pretty resistant to any of the caching help we’ve discussed so far. Luckily GitHub actions have a cache option just for this. Here’s the pip example from GitHub. Use the directory /root/.cache/uv for uv if you’re using uv instead.

The take-away

Adding just small and simple sections to your Dockerfile could reduce your load on PyPI by over 100x (for real, no hyperbole here).

First, we pre-cache and install all of the core dependencies of our app like this:

# Install all of the main dependencies
# Do NOT pin these versions, we want the latest here
RUN uv pip install flask beanie gunicorn

Second, we only update these if the requirements.txt file (or pyproject.toml, or whatever) changes:

# Copy just the requirements.txt and install them
COPY src/yourapp/requirements.txt /app
RUN uv pip install -r requirements.txt

It’ll make your containers build way way faster once it’s cached and it’ll take a massive load off of PyPI and do the PSF and the whole Python community a favor making PyPI much cheaper to operate for all of us.

You can find a working example in a previous post, Docker images using uv’s python. That one covered some of these ideas but focused on making Docker builds fast for Python apps, rather than the focus of this essay which is all about being kind to the PSF and PyPI.