As development on the go-livepeer updates for the AI video subnet progresses, wanted to share a benchmarking script that could be of interest to the community for starting to experiment with running text-to-image, image-to-image and image-to-video jobs on GPUs and gathering metrics.
Note: Only Nvidia GPUs are supported right now.
Getting started
git clone https://github.com/livepeer/ai-worker.git
cd ai-worker/runner
The README in the runner
directory contains instructions for running the benchmarking script.
The dl_checkpoints.sh script contains the current list of models that have been tested.
Example benchmark run on a RTX 3090
Let’s benchmark the image-to-video pipeline using the stabilityai/stable-video-diffusion-img2vid-xt (SVD) model. By default, the script will run inference with the pipeline once.
docker run --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt
Output (extra logs omitted):
----AGGREGATE METRICS----
pipeline load time: 2.661s
pipeline load max GPU memory allocated: 4.231GiB
pipeline load max GPU memory reserved: 4.441GiB
avg inference time: 95.533s
avg inference time per output: 95.533s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 21.695GiB
The script output shows metrics on the time it took to the load the pipeline/model into VRAM, max VRAM consumed by loading the pipeline/model, average inference time for the pipeline/model and the average max VRAM consumed during inference.
We can also benchmark the same pipeline and model with optimizations enabled to observe the difference in performance and resource consumption. At the moment, the stable-fast optimization is supported so let’s enable that.
docker run -e SFAST=true --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt
Output (extra logs omitted):
----AGGREGATE METRICS----
pipeline load time: 2.361s
pipeline load max GPU memory allocated: 4.286GiB
pipeline load max GPU memory reserved: 4.559GiB
avg warmup inference time: 97.994s
avg warmup inference time per output: 97.994s
avg warmup inference max GPU memory allocated: 13.324GiB
avg warmup inference max GPU memory reserved: 23.078GiB
avg inference time: 70.525s
avg inference time per output: 70.525s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 23.078GiB
The first few inference runs for a model using stable-fast will be slower because the model is dynamically compiled so the benchmarking script tracks the metrics for “warmup” inference separately.
We can see that when stable-fast is enabled, the image-to-video pipeline with SVD is ~26% faster after the warmup runs and also consumes additional VRAM (~2 GiB).
Opportunities
A few (non-exhaustive) opportunities that the community might be interested in independently exploring:
- Experiment with
--batch_size
. This parameter controls the # of outputs in a batch generated by a diffusion pipeline (see this for more information). Using higher batch sizes for inference with models usually has the potential to increase throughput while increasing VRAM consumption, but interestingly the tradeoff didn’t seem worth it from my early tests discussed here. Does this hold true for all diffusion models? Is this inherent to diffusion models or due to some quirk of the diffusers library? - Are there any other optimizations available that are either additive to or better than stable-fast (maybe DeepCache?)? Try them out and share the benchmarks!
- Compare benchmark metrics for GPUs with existing data on running diffusion models with different GPUs.
- Share the benchmark metrics for GPUs that you have access to.
- Fork/improve the benchmarking script with any other relevant metrics.
What else could be useful/interesting? Share below!