A benchmarking script for AI video subnet jobs

As development on the go-livepeer updates for the AI video subnet progresses, wanted to share a benchmarking script that could be of interest to the community for starting to experiment with running text-to-image, image-to-image and image-to-video jobs on GPUs and gathering metrics.

Note: Only Nvidia GPUs are supported right now.

Getting started

git clone https://github.com/livepeer/ai-worker.git
cd ai-worker/runner

The README in the runner directory contains instructions for running the benchmarking script.

The dl_checkpoints.sh script contains the current list of models that have been tested.

Example benchmark run on a RTX 3090

Let’s benchmark the image-to-video pipeline using the stabilityai/stable-video-diffusion-img2vid-xt (SVD) model. By default, the script will run inference with the pipeline once.

docker run --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

Output (extra logs omitted):

----AGGREGATE METRICS----


pipeline load time: 2.661s
pipeline load max GPU memory allocated: 4.231GiB
pipeline load max GPU memory reserved: 4.441GiB
avg inference time: 95.533s
avg inference time per output: 95.533s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 21.695GiB

The script output shows metrics on the time it took to the load the pipeline/model into VRAM, max VRAM consumed by loading the pipeline/model, average inference time for the pipeline/model and the average max VRAM consumed during inference.

We can also benchmark the same pipeline and model with optimizations enabled to observe the difference in performance and resource consumption. At the moment, the stable-fast optimization is supported so let’s enable that.

docker run -e SFAST=true --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

Output (extra logs omitted):

----AGGREGATE METRICS----


pipeline load time: 2.361s
pipeline load max GPU memory allocated: 4.286GiB
pipeline load max GPU memory reserved: 4.559GiB
avg warmup inference time: 97.994s
avg warmup inference time per output: 97.994s
avg warmup inference max GPU memory allocated: 13.324GiB
avg warmup inference max GPU memory reserved: 23.078GiB
avg inference time: 70.525s
avg inference time per output: 70.525s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 23.078GiB

The first few inference runs for a model using stable-fast will be slower because the model is dynamically compiled so the benchmarking script tracks the metrics for “warmup” inference separately.

We can see that when stable-fast is enabled, the image-to-video pipeline with SVD is ~26% faster after the warmup runs and also consumes additional VRAM (~2 GiB).

Opportunities

A few (non-exhaustive) opportunities that the community might be interested in independently exploring:

  • Experiment with --batch_size. This parameter controls the # of outputs in a batch generated by a diffusion pipeline (see this for more information). Using higher batch sizes for inference with models usually has the potential to increase throughput while increasing VRAM consumption, but interestingly the tradeoff didn’t seem worth it from my early tests discussed here. Does this hold true for all diffusion models? Is this inherent to diffusion models or due to some quirk of the diffusers library?
  • Are there any other optimizations available that are either additive to or better than stable-fast (maybe DeepCache?)? Try them out and share the benchmarks!
  • Compare benchmark metrics for GPUs with existing data on running diffusion models with different GPUs.
  • Share the benchmark metrics for GPUs that you have access to.
  • Fork/improve the benchmarking script with any other relevant metrics.

What else could be useful/interesting? Share below!

3 Likes

Did my first benchmark with my RTX 4070 (laptop), AMD Ryzen 9 7940HS, 16.0 GB RAM, NVME SSD, Windows 11.

docker run --gpus 0 -v C:\local\models\models:/models livepeer/ai-runner:latest python bench.py --pipeline text-to-image --model_id stabilityai/sd-turbo --runs 3

Results:

----AGGREGATE METRICS----


pipeline load time: 134.418s
pipeline load max GPU memory allocated: 2.419GiB
pipeline load max GPU memory reserved: 2.465GiB
avg inference time: 0.763s
avg inference time per output: 0.763s
avg inference max GPU memory allocated: 3.023GiB
avg inference max GPU memory reserved: 3.609GiB

Will mess around with the other commands and models as well.
@yondon do you know what is considered a good benchmark? I see mine is quite different than yours. Or are we still in the exploring phase of finding out what’s possible?
Good work so far! :partying_face:

1 Like

Just ran my benchmark, GPU was running HOT HOT HOT :fire:

docker run --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

----AGGREGATE METRICS----


pipeline load time: 1.589s
pipeline load max GPU memory allocated: 2.420GiB
pipeline load max GPU memory reserved: 2.480GiB
avg inference time: 0.266s
avg inference time per output: 0.266s
avg inference max GPU memory allocated: 3.024GiB
avg inference max GPU memory reserved: 3.625GiB

`docker run -e SFAST=true --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt`

pipeline load time: 1.997s
pipeline load max GPU memory allocated: 4.283GiB
pipeline load max GPU memory reserved: 4.543GiB
avg warmup inference time: 61.645s
avg warmup inference time per output: 61.645s
avg warmup inference max GPU memory allocated: 13.324GiB
avg warmup inference max GPU memory reserved: 15.219GiB
avg inference time: 47.712s
avg inference time per output: 47.712s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 15.219GiB


+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4080        Off | 00000000:0B:00.0 Off |                  N/A |
| 33%   56C    P2             294W / 320W |  15373MiB / 16376MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk
# Idx      W      C      C      %      %      %      %      %      %    MHz    MHz
    0    285     56      -    100     53      0      0      0      0  10802   2595
    0    295     52      -    100     41      0      0      0      0  10802   2475
    0    284     53      -     99     50      0      0      0      0  10802   2595
    0    237     48      -    100     72      0      0      0      0  10802   2745
    0    245     46      -    100     68      0      0      0      0  10802   2760
    0    236     49      -    100     47      0      0      0      0  10802   2745
    0    217     44      -     83     64      0      0      0      0  10802   2760
    0     70     49      -     32     14      0      0      0      0  10802   2745
    0    257     56      -    100     37      0      0      0      0  10802   2505
    0    288     54      -    100     56      0      0      0      0  10802   2625
    0    294     54      -    100     43      0      0      0      0  10802   2595
    0    282     57      -    100     46      0      0      0      0  10802   2640
    0    295     55      -    100     53      0      0      0      0  10802   2565
    0    284     56      -    100     56      0      0      0      0  10802   2520
    0    287     54      -    100     56      0      0      0      0  10802   2745
    0    296     54      -    100     37      0      0      0      0  10802   2445
    0    284     55      -    100     48      0      0      0      0  10802   2730
    0    301     53      -    100     52      0      0      0      0  10802   2700
    0    287     57      -    100     54      0      0      0      0  10802   2475
    0    294     58      -    100     54      0      0      0      0  10802   2565
    0    293     56      -    100     39      0      0      0      0  10802   2520
    0    289     55      -    100     51      0      0      0      0  10802   2550
    0    294     55      -    100     54      0      0      0      0  10802   2475
    0    283     55      -    100     52      0      0      0      0  10802   2595
    0    295     54      -    100     48      0      0      0      0  10802   2745
    0    291     57      -    100     38      0      0      0      0  10802   2445
    0    284     52      -    100     50      0      0      0      0  10802   2745
    0    299     57      -    100     41      0      0      0      0  10802   2550
    0    286     57      -    100     52      0      0      0      0  10802   2670
    0    295     57      -    100     52      0      0      0      0  10802   2640
    0    287     58      -    100     58      0      0      0      0  10802   2535
    0    291     57      -    100     55      0      0      0      0  10802   2640
    0    297     55      -    100     36      0      0      0      0  10802   2475
    0    283     56      -    100     44      0      0      0      0  10802   2745
    0    297     55      -    100     50      0      0      0      0  10802   2745
    0    284     56      -    100     54      0      0      0      0  10802   2505
    0    286     53      -    100     55      0      0      0      0  10802   2550
    0    293     57      -    100     37      0      0      0      0  10802   2475
    0    286     55      -    100     45      0      0      0      0  10802   2595
    0    294     57      -    100     52      0      0      0      0  10802   2520
    0    282     57      -    100     52      0      0      0      0  10802   2625
    0    289     54      -    100     50      0      0      0      0  10802   2745
    0    289     54      -    100     42      0      0      0      0  10802   2385
    0    288     52      -    100     46      0      0      0      0  10802   2745

I see mine is quite different than yours.

Your benchmark is for the text-to-image pipeline with the stabilityai/sd-turbo model while my benchmark in the OP is for the image-to-video pipeline with the stabilityai/stable-video-diffusion-img2vid-xt model which explains the significant difference in metrics. Generally, video models will be slower and will consume more VRAM than image models.

do you know what is considered a good benchmark?

Still collecting data at this point.

Would be helpful for others for the command used (including the configuration which indicates the pipeline and model ID) to be shared as well!

1 Like

Benchmark with 4070 Ti Super 16 giga, 32 gigas Ram, ubuntu 22.04, AMD Ryzen 7

docker run -e SFAST=true --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

----AGGREGATE METRICS----


pipeline load time: 1.289s
pipeline load max GPU memory allocated: 4.283GiB
pipeline load max GPU memory reserved: 4.543GiB
avg warmup inference time: 72.813s
avg warmup inference time per output: 72.813s
avg warmup inference max GPU memory allocated: 13.324GiB
avg warmup inference max GPU memory reserved: 15.219GiB
avg inference time: 58.279s
avg inference time per output: 58.279s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 15.219GiB
sudo docker run -e SFAST=true --gpus all -v /models:/models livepeer/ai-runner:latest python bench.py --pipeline text-to-image --model_id stabilityai/sd-turbo --runs 3

----AGGREGATE METRICS----


pipeline load time: 1.351s
pipeline load max GPU memory allocated: 2.476GiB
pipeline load max GPU memory reserved: 2.588GiB
avg warmup inference time: 2.440s
avg warmup inference time per output: 2.440s
avg warmup inference max GPU memory allocated: 2.869GiB
avg warmup inference max GPU memory reserved: 3.064GiB
avg inference time: 0.051s
avg inference time per output: 0.051s
avg inference max GPU memory allocated: 2.869GiB
avg inference max GPU memory reserved: 3.064GiB

I initially got the same results using Docker Desktop on Windows 11.

Everything became normal using Docker in Ubuntu outside of a VM.

Benchmark with 4090, 64GB Ram, ubuntu 22.04, Intel 14900

docker run -e SFAST=true --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

----AGGREGATE METRICS----

pipeline load time: 1.145s
pipeline load max GPU memory allocated: 4.286GiB
pipeline load max GPU memory reserved: 4.523GiB
avg warmup inference time: 41.376s
avg warmup inference time per output: 41.376s
avg warmup inference max GPU memory allocated: 13.325GiB
avg warmup inference max GPU memory reserved: 20.818GiB
avg inference time: 32.068s
avg inference time per output: 32.068s
avg inference max GPU memory allocated: 13.325GiB
avg inference max GPU memory reserved: 20.818GiB

docker run -e SFAST=true --gpus 0 -v /models:/models livepeer/ai-runner:latest python bench.py --pipeline text-to-image --model_id stabilityai/sd-turbo --runs 3

----AGGREGATE METRICS----

pipeline load time: 1.363s
pipeline load max GPU memory allocated: 2.475GiB
pipeline load max GPU memory reserved: 2.590GiB
avg warmup inference time: 1.772s
avg warmup inference time per output: 1.772s
avg warmup inference max GPU memory allocated: 2.868GiB
avg warmup inference max GPU memory reserved: 3.068GiB
avg inference time: 0.032s
avg inference time per output: 0.032s
avg inference max GPU memory allocated: 2.868GiB
avg inference max GPU memory reserved: 3.068GiB

docker run -e SFAST=true --gpus 0 -v /models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-image --model_id stabilityai/sd-turbo --runs 3

----AGGREGATE METRICS----

pipeline load time: 0.913s
pipeline load max GPU memory allocated: 2.476GiB
pipeline load max GPU memory reserved: 2.588GiB
avg warmup inference time: 2.540s
avg warmup inference time per output: 2.540s
avg warmup inference max GPU memory allocated: 4.194GiB
avg warmup inference max GPU memory reserved: 5.598GiB
avg inference time: 0.151s
avg inference time per output: 0.151s
avg inference max GPU memory allocated: 4.194GiB
avg inference max GPU memory reserved: 5.598GiB

4060 ti 16gb
Intel i7-10700 32gb ram - one core is pinned at 100% most of the time inference is running
Ubuntu 20.04

docker run -e SFAST=true --gpus ‘“device=1”’ -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

----AGGREGATE METRICS----

pipeline load time: 2.691s
pipeline load max GPU memory allocated: 4.286GiB
pipeline load max GPU memory reserved: 4.559GiB
avg warmup inference time: 136.927s
avg warmup inference time per output: 136.927s
avg warmup inference max GPU memory allocated: 13.324GiB
avg warmup inference max GPU memory reserved: 15.203GiB
avg inference time: 113.575s
avg inference time per output: 113.575s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 15.203GiB

sudo docker run -e SFAST=true --gpus ‘“device=1”’ -v /models:/models livepeer/ai-runner:latest python bench.py --pipeline text-to-image --model_id stabilityai/sd-turbo --runs 3

increasing batch size up to 20 on the sd xl turbo provides about 7s per output

----AGGREGATE METRICS----

pipeline load time: 46.385s
pipeline load max GPU memory allocated: 2.476GiB
pipeline load max GPU memory reserved: 2.588GiB
avg warmup inference time: 3.643s
avg warmup inference time per output: 3.643s
avg warmup inference max GPU memory allocated: 2.870GiB
avg warmup inference max GPU memory reserved: 3.064GiB
avg inference time: 0.089s
avg inference time per output: 0.089s
avg inference max GPU memory allocated: 2.870GiB
avg inference max GPU memory reserved: 3.064GiB

I see many people are posting results of some runs of this script against one or two of the models. This is certainly helpful anecdotally, but the question remains - how can we make the output of these benchmarks useful to current or aspiring Orchestrators?

My thought is that in the future O’s will be making hardware investment decisions based upon their ability to perform inference well enough to retain work on the network. And for this, they’ll want glanceable answers to seeing inference times and memory usage under different models, cards, and VRAM combinations.

My thought is that a Livepeer Open Network grant would certainly be available if anyone wants to extend this benchmark script to do the following.

  1. In one simple run, produce benchmarks for ALL of the supported models, rather than requiring the user to just run it arbitrarily for one or two models and leave out the rest.
  2. Produce the output in a parseable format, such as CSV, so that scripts could be written to analyze the data. (Consider including some standardization of the cards + VRAM configs in the output so similar results can be compared. Consider an easy optional “submission” of this output to the community, such as automatically posting the CSV output to some collector endpoint or even just emailing it to yourself as the aggregator.)
  3. Organize some easily visible table on a wiki somewhere that shows the average benchmarks for each model under each card/VRAM combo. Maintain this and update it with some frequency.
  4. Generally improve the benchmarking script beyond the initial implementation if it can be better or make more optimized use of the hardware, so it remains accurate as to how O’s would set up to run inference as the Livepeer implementation improves.

I think this organized benchmark resource would be actionable and valuable to O’s in the community. Anyone agree or disagree, or want to take this on?

4 Likes

Hi everyone :wave:

So I’ve created a spreadsheet where we can keep track of our benchmarking.

Just head to the two tabs on the bottom and fill in your results for avg inference time and avg inference max GPU memory allocated.

This benchmarking sheet includes all 6 models for a total of 22 benchmarks with each model including one with the SFAST tag and one without.

  • sd-turbo
  • sdxl-turbo
  • stable-diffusion-v1-5
  • stable-diffusion-xl-base-1.0
  • openjourney-v4
  • stable-video-diffusion-img2vid-xt

Here is a copy of the benchmark scripts to run. Just replace the -v flag with your local storage volume with the models in it.

It is quite a bit of work to do this. I may be able to write a script to execute all these and automatically put them into the spreadsheet format but we can just start with this.

Hopefully we can get some solid data to work on :+1:

1 Like

Hi @yondon, is there a flag to adjust batch size?
My 4070 laptop is running out of memory on image-video inference.

Yep there is a –batch_size flag.

Also just fyi, if you don’t have a big enough power supply I believe the benchmarking will shut down your machine. Pretty heavy stuff!