Benchmarking GPU sharing strategies in Kubernetes
Benchmarking GPU sharing strategies in Kubernetes This writeup is the conclusion of my previous post. If you don’t know what MIG, MPS and Time Slicing do, I’d suggest reading that one first.
Before talking about the results, there’s one thing worth calling out in the Pytorch notes on CUDA:
By default, GPU operations are asynchronous. When you call a function that uses the GPU, the operations are enqueued to the particular device, but not necessarily executed until later.
Understanding GPU sharing strategies in Kubernetes
Understanding GPU sharing strategies in Kubernetes [UPDATE]
I’ve now posted my benchmarks, so if you just need to see those stats then have a read of benchmarking-gpu-sharing-strategies-in-kubernetes.
These notes are aimed at anyone that wants to setup Nvidia GPU sharing strategies within k8s without having to trawl through a lot of crypic and dense Nvidia documentation. I’m also focusing on a high level ELI5, using the knowledge I’ve gained so far on the subject, of:
A trick to speed up and slim down bloated Node images
Disclaimer Before reading, you should only need to do this if you cannot use Plug and Play (PnP) from either yarn or pnpm. PnP does not have this issue as it loads modules that are stored as zip files. A single module that is zipped is just 1 file vs potentially 10s if not 100s of 1000s of files coming from each of your dependencies of dependencies of dependencies etc. etc.