Disclaimer
Before reading, you should only need to do this if you cannot use Plug and Play (PnP) from either yarn or pnpm. PnP does not have this issue as it loads modules that are stored as zip files. A single module that is zipped is just 1 file vs potentially 10s if not 100s of 1000s of files coming from each of your dependencies of dependencies of dependencies etc. etc.
Secondly, your mileage may vary with this trick. For anecdotal evidence, in one project that I worked on, this change yielded a 5-10 minute overall gain in CI. Monolithic repos will see a win, but for tiny node apps the impact is likely to be negligible.
Problem
If you use Docker and are stuck with a large e.g. 1GB+ project shipping with a “classic” node_modules
directory then, unless you’re extremely disciplined about dependencies, you’ll run into this problem:
When a node modules folder is installed via a Dockerfile, it will produce a single docker layer. That layer has to be extracted to disk on a docker pull
and it will also have to be exported from buildkit after the build phase has completed. Having to negotiate with a disk for lots of tiny files hurts I/O. This in turns slows down your docker image creation/retrieval commands. In CI systems, with a cache bust this will also cause slowdown in your build chain e.g. slower deployment steps due to a longer pull/extraction time.
A fudged solution
It turns out that a good chunk of the assets in a node modules directory are redundant for running your Node application. For example, if we consider a vanilla JavaScript project, all of the editor related tooling e.g linting, editorconfig, flow types, typescript definitions are not required in CI. I’ve even spotted CI config being shipped in some libraries.
If we take the most basic Dockerfile to illustrate this:
FROM node:20-slim
COPY package.json package-lock.json .
RUN npm i --omit=dev
...
And change it to this
FROM node:20-slim
COPY package.json package-lock.json .
# repurpose to your needs, for example, I don't use use TypeScript so I don't need ts or d.ts files
# but if you build a typescript app you'll need them. I'm also not worried about source map files from third party libraries
RUN npm i --omit=dev && find ./node_modules -type f \( -iname "*.md" -o -iname "*.yaml" -o -iname "*.txt" -o -iname ".nycrc" -o -iname "*.d.*" -o -iname "*.flow" -o -iname "*.ts" -o -iname "*.map" -o -iname "*.yml" -o -iname "*.yaml" -o -iname "*.eslintrc" -o -iname "*.npmignore" -o -iname "*.editorconfig" \) -delete
...
We can cut down the number of files that is included in the layer generated by the RUN
instruction. The examples above omit dev dependencies, but this is particularly worth doing for pipeline phases that need to include dev dependencies.
Note that the removal step has to be in the same instruction as the npm i
. If you don’t do this, they’ll end up in separate Docker layers. And if they’re in separate layers that means that the bloat is still in the image history. Therefore, you won’t get any of the performance gains.
Commands for inspecting your node modules folder
To poke around your directory and figure out what you can/cannot delete, use these two bash snippets:
# find total number of files in the current directory
find . -type f | wc -l
# find all files, sed all extensions, sort them, then count the number of occurences per extension then sort numerically by the first column
find . -type f | sed -n 's/.*\.\(.*\)/\1/p' | sort | uniq -c | sort -n
It’s frustrating that I have to do this in the first place just to try and mitigate for dependency bloat. For personal projects I’ve moved away from Node to languages that ship as a single binary to avoid this problem altogether.
Other options
I’m just going to focus on Node here. Single executable applications are on the horizon, but it’s still experimental. I’m also unsure if it will even work with third party dependencies or just modules built into the node runtime.
Caveat - The legal stuff
There are actually even more files that could be deleted. You’d have in theory one LICENSE
file per dependency.
I’m not a lawyer, but I do know there’s at least one legal thing worth calling out. Most if not all third party libraries will require you to ship the LICENSE
file with your distributed application. For example an MIT License has a statement along these lines in it:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
Therefore, for projects that are published in some shape or form I would just ship the license agreements.
Alternatively use a separate build phase e.g. another Dockerfile to retrieve all of the license agreements. That way you can still provide them as part of a licenses section/page on your site.