A benchmarking script for AI video subnet jobs

As development on the go-livepeer updates for the AI video subnet progresses, wanted to share a benchmarking script that could be of interest to the community for starting to experiment with running text-to-image, image-to-image and image-to-video jobs on GPUs and gathering metrics.

Note: Only Nvidia GPUs are supported right now.

Getting started

git clone https://github.com/livepeer/ai-worker.git
cd ai-worker/runner

The README in the runner directory contains instructions for running the benchmarking script.

The dl_checkpoints.sh script contains the current list of models that have been tested.

Example benchmark run on a RTX 3090

Let’s benchmark the image-to-video pipeline using the stabilityai/stable-video-diffusion-img2vid-xt (SVD) model. By default, the script will run inference with the pipeline once.

docker run --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

Output (extra logs omitted):

----AGGREGATE METRICS----


pipeline load time: 2.661s
pipeline load max GPU memory allocated: 4.231GiB
pipeline load max GPU memory reserved: 4.441GiB
avg inference time: 95.533s
avg inference time per output: 95.533s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 21.695GiB

The script output shows metrics on the time it took to the load the pipeline/model into VRAM, max VRAM consumed by loading the pipeline/model, average inference time for the pipeline/model and the average max VRAM consumed during inference.

We can also benchmark the same pipeline and model with optimizations enabled to observe the difference in performance and resource consumption. At the moment, the stable-fast optimization is supported so let’s enable that.

docker run -e SFAST=true --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

Output (extra logs omitted):

----AGGREGATE METRICS----


pipeline load time: 2.361s
pipeline load max GPU memory allocated: 4.286GiB
pipeline load max GPU memory reserved: 4.559GiB
avg warmup inference time: 97.994s
avg warmup inference time per output: 97.994s
avg warmup inference max GPU memory allocated: 13.324GiB
avg warmup inference max GPU memory reserved: 23.078GiB
avg inference time: 70.525s
avg inference time per output: 70.525s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 23.078GiB

The first few inference runs for a model using stable-fast will be slower because the model is dynamically compiled so the benchmarking script tracks the metrics for “warmup” inference separately.

We can see that when stable-fast is enabled, the image-to-video pipeline with SVD is ~26% faster after the warmup runs and also consumes additional VRAM (~2 GiB).

Opportunities

A few (non-exhaustive) opportunities that the community might be interested in independently exploring:

  • Experiment with --batch_size. This parameter controls the # of outputs in a batch generated by a diffusion pipeline (see this for more information). Using higher batch sizes for inference with models usually has the potential to increase throughput while increasing VRAM consumption, but interestingly the tradeoff didn’t seem worth it from my early tests discussed here. Does this hold true for all diffusion models? Is this inherent to diffusion models or due to some quirk of the diffusers library?
  • Are there any other optimizations available that are either additive to or better than stable-fast (maybe DeepCache?)? Try them out and share the benchmarks!
  • Compare benchmark metrics for GPUs with existing data on running diffusion models with different GPUs.
  • Share the benchmark metrics for GPUs that you have access to.
  • Fork/improve the benchmarking script with any other relevant metrics.

What else could be useful/interesting? Share below!

4 Likes

Did my first benchmark with my RTX 4070 (laptop), AMD Ryzen 9 7940HS, 16.0 GB RAM, NVME SSD, Windows 11.

docker run --gpus 0 -v C:\local\models\models:/models livepeer/ai-runner:latest python bench.py --pipeline text-to-image --model_id stabilityai/sd-turbo --runs 3

Results:

----AGGREGATE METRICS----


pipeline load time: 134.418s
pipeline load max GPU memory allocated: 2.419GiB
pipeline load max GPU memory reserved: 2.465GiB
avg inference time: 0.763s
avg inference time per output: 0.763s
avg inference max GPU memory allocated: 3.023GiB
avg inference max GPU memory reserved: 3.609GiB

Will mess around with the other commands and models as well.
@yondon do you know what is considered a good benchmark? I see mine is quite different than yours. Or are we still in the exploring phase of finding out what’s possible?
Good work so far! :partying_face:

2 Likes

Just ran my benchmark, GPU was running HOT HOT HOT :fire:

docker run --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

----AGGREGATE METRICS----


pipeline load time: 1.589s
pipeline load max GPU memory allocated: 2.420GiB
pipeline load max GPU memory reserved: 2.480GiB
avg inference time: 0.266s
avg inference time per output: 0.266s
avg inference max GPU memory allocated: 3.024GiB
avg inference max GPU memory reserved: 3.625GiB

`docker run -e SFAST=true --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt`

pipeline load time: 1.997s
pipeline load max GPU memory allocated: 4.283GiB
pipeline load max GPU memory reserved: 4.543GiB
avg warmup inference time: 61.645s
avg warmup inference time per output: 61.645s
avg warmup inference max GPU memory allocated: 13.324GiB
avg warmup inference max GPU memory reserved: 15.219GiB
avg inference time: 47.712s
avg inference time per output: 47.712s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 15.219GiB


+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4080        Off | 00000000:0B:00.0 Off |                  N/A |
| 33%   56C    P2             294W / 320W |  15373MiB / 16376MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk
# Idx      W      C      C      %      %      %      %      %      %    MHz    MHz
    0    285     56      -    100     53      0      0      0      0  10802   2595
    0    295     52      -    100     41      0      0      0      0  10802   2475
    0    284     53      -     99     50      0      0      0      0  10802   2595
    0    237     48      -    100     72      0      0      0      0  10802   2745
    0    245     46      -    100     68      0      0      0      0  10802   2760
    0    236     49      -    100     47      0      0      0      0  10802   2745
    0    217     44      -     83     64      0      0      0      0  10802   2760
    0     70     49      -     32     14      0      0      0      0  10802   2745
    0    257     56      -    100     37      0      0      0      0  10802   2505
    0    288     54      -    100     56      0      0      0      0  10802   2625
    0    294     54      -    100     43      0      0      0      0  10802   2595
    0    282     57      -    100     46      0      0      0      0  10802   2640
    0    295     55      -    100     53      0      0      0      0  10802   2565
    0    284     56      -    100     56      0      0      0      0  10802   2520
    0    287     54      -    100     56      0      0      0      0  10802   2745
    0    296     54      -    100     37      0      0      0      0  10802   2445
    0    284     55      -    100     48      0      0      0      0  10802   2730
    0    301     53      -    100     52      0      0      0      0  10802   2700
    0    287     57      -    100     54      0      0      0      0  10802   2475
    0    294     58      -    100     54      0      0      0      0  10802   2565
    0    293     56      -    100     39      0      0      0      0  10802   2520
    0    289     55      -    100     51      0      0      0      0  10802   2550
    0    294     55      -    100     54      0      0      0      0  10802   2475
    0    283     55      -    100     52      0      0      0      0  10802   2595
    0    295     54      -    100     48      0      0      0      0  10802   2745
    0    291     57      -    100     38      0      0      0      0  10802   2445
    0    284     52      -    100     50      0      0      0      0  10802   2745
    0    299     57      -    100     41      0      0      0      0  10802   2550
    0    286     57      -    100     52      0      0      0      0  10802   2670
    0    295     57      -    100     52      0      0      0      0  10802   2640
    0    287     58      -    100     58      0      0      0      0  10802   2535
    0    291     57      -    100     55      0      0      0      0  10802   2640
    0    297     55      -    100     36      0      0      0      0  10802   2475
    0    283     56      -    100     44      0      0      0      0  10802   2745
    0    297     55      -    100     50      0      0      0      0  10802   2745
    0    284     56      -    100     54      0      0      0      0  10802   2505
    0    286     53      -    100     55      0      0      0      0  10802   2550
    0    293     57      -    100     37      0      0      0      0  10802   2475
    0    286     55      -    100     45      0      0      0      0  10802   2595
    0    294     57      -    100     52      0      0      0      0  10802   2520
    0    282     57      -    100     52      0      0      0      0  10802   2625
    0    289     54      -    100     50      0      0      0      0  10802   2745
    0    289     54      -    100     42      0      0      0      0  10802   2385
    0    288     52      -    100     46      0      0      0      0  10802   2745

I see mine is quite different than yours.

Your benchmark is for the text-to-image pipeline with the stabilityai/sd-turbo model while my benchmark in the OP is for the image-to-video pipeline with the stabilityai/stable-video-diffusion-img2vid-xt model which explains the significant difference in metrics. Generally, video models will be slower and will consume more VRAM than image models.

do you know what is considered a good benchmark?

Still collecting data at this point.

Would be helpful for others for the command used (including the configuration which indicates the pipeline and model ID) to be shared as well!

1 Like

Benchmark with 4070 Ti Super 16 giga, 32 gigas Ram, ubuntu 22.04, AMD Ryzen 7

docker run -e SFAST=true --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

----AGGREGATE METRICS----


pipeline load time: 1.289s
pipeline load max GPU memory allocated: 4.283GiB
pipeline load max GPU memory reserved: 4.543GiB
avg warmup inference time: 72.813s
avg warmup inference time per output: 72.813s
avg warmup inference max GPU memory allocated: 13.324GiB
avg warmup inference max GPU memory reserved: 15.219GiB
avg inference time: 58.279s
avg inference time per output: 58.279s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 15.219GiB
sudo docker run -e SFAST=true --gpus all -v /models:/models livepeer/ai-runner:latest python bench.py --pipeline text-to-image --model_id stabilityai/sd-turbo --runs 3

----AGGREGATE METRICS----


pipeline load time: 1.351s
pipeline load max GPU memory allocated: 2.476GiB
pipeline load max GPU memory reserved: 2.588GiB
avg warmup inference time: 2.440s
avg warmup inference time per output: 2.440s
avg warmup inference max GPU memory allocated: 2.869GiB
avg warmup inference max GPU memory reserved: 3.064GiB
avg inference time: 0.051s
avg inference time per output: 0.051s
avg inference max GPU memory allocated: 2.869GiB
avg inference max GPU memory reserved: 3.064GiB

I initially got the same results using Docker Desktop on Windows 11.

Everything became normal using Docker in Ubuntu outside of a VM.

Benchmark with 4090, 64GB Ram, ubuntu 22.04, Intel 14900

docker run -e SFAST=true --gpus 0 -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

----AGGREGATE METRICS----

pipeline load time: 1.145s
pipeline load max GPU memory allocated: 4.286GiB
pipeline load max GPU memory reserved: 4.523GiB
avg warmup inference time: 41.376s
avg warmup inference time per output: 41.376s
avg warmup inference max GPU memory allocated: 13.325GiB
avg warmup inference max GPU memory reserved: 20.818GiB
avg inference time: 32.068s
avg inference time per output: 32.068s
avg inference max GPU memory allocated: 13.325GiB
avg inference max GPU memory reserved: 20.818GiB

docker run -e SFAST=true --gpus 0 -v /models:/models livepeer/ai-runner:latest python bench.py --pipeline text-to-image --model_id stabilityai/sd-turbo --runs 3

----AGGREGATE METRICS----

pipeline load time: 1.363s
pipeline load max GPU memory allocated: 2.475GiB
pipeline load max GPU memory reserved: 2.590GiB
avg warmup inference time: 1.772s
avg warmup inference time per output: 1.772s
avg warmup inference max GPU memory allocated: 2.868GiB
avg warmup inference max GPU memory reserved: 3.068GiB
avg inference time: 0.032s
avg inference time per output: 0.032s
avg inference max GPU memory allocated: 2.868GiB
avg inference max GPU memory reserved: 3.068GiB

docker run -e SFAST=true --gpus 0 -v /models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-image --model_id stabilityai/sd-turbo --runs 3

----AGGREGATE METRICS----

pipeline load time: 0.913s
pipeline load max GPU memory allocated: 2.476GiB
pipeline load max GPU memory reserved: 2.588GiB
avg warmup inference time: 2.540s
avg warmup inference time per output: 2.540s
avg warmup inference max GPU memory allocated: 4.194GiB
avg warmup inference max GPU memory reserved: 5.598GiB
avg inference time: 0.151s
avg inference time per output: 0.151s
avg inference max GPU memory allocated: 4.194GiB
avg inference max GPU memory reserved: 5.598GiB

4060 ti 16gb
Intel i7-10700 32gb ram - one core is pinned at 100% most of the time inference is running
Ubuntu 20.04

docker run -e SFAST=true --gpus ‘“device=1”’ -v ./models:/models livepeer/ai-runner:latest python bench.py --pipeline image-to-video --model_id stabilityai/stable-video-diffusion-img2vid-xt

----AGGREGATE METRICS----

pipeline load time: 2.691s
pipeline load max GPU memory allocated: 4.286GiB
pipeline load max GPU memory reserved: 4.559GiB
avg warmup inference time: 136.927s
avg warmup inference time per output: 136.927s
avg warmup inference max GPU memory allocated: 13.324GiB
avg warmup inference max GPU memory reserved: 15.203GiB
avg inference time: 113.575s
avg inference time per output: 113.575s
avg inference max GPU memory allocated: 13.324GiB
avg inference max GPU memory reserved: 15.203GiB

sudo docker run -e SFAST=true --gpus ‘“device=1”’ -v /models:/models livepeer/ai-runner:latest python bench.py --pipeline text-to-image --model_id stabilityai/sd-turbo --runs 3

increasing batch size up to 20 on the sd xl turbo provides about 7s per output

----AGGREGATE METRICS----

pipeline load time: 46.385s
pipeline load max GPU memory allocated: 2.476GiB
pipeline load max GPU memory reserved: 2.588GiB
avg warmup inference time: 3.643s
avg warmup inference time per output: 3.643s
avg warmup inference max GPU memory allocated: 2.870GiB
avg warmup inference max GPU memory reserved: 3.064GiB
avg inference time: 0.089s
avg inference time per output: 0.089s
avg inference max GPU memory allocated: 2.870GiB
avg inference max GPU memory reserved: 3.064GiB

I see many people are posting results of some runs of this script against one or two of the models. This is certainly helpful anecdotally, but the question remains - how can we make the output of these benchmarks useful to current or aspiring Orchestrators?

My thought is that in the future O’s will be making hardware investment decisions based upon their ability to perform inference well enough to retain work on the network. And for this, they’ll want glanceable answers to seeing inference times and memory usage under different models, cards, and VRAM combinations.

My thought is that a Livepeer Open Network grant would certainly be available if anyone wants to extend this benchmark script to do the following.

  1. In one simple run, produce benchmarks for ALL of the supported models, rather than requiring the user to just run it arbitrarily for one or two models and leave out the rest.
  2. Produce the output in a parseable format, such as CSV, so that scripts could be written to analyze the data. (Consider including some standardization of the cards + VRAM configs in the output so similar results can be compared. Consider an easy optional “submission” of this output to the community, such as automatically posting the CSV output to some collector endpoint or even just emailing it to yourself as the aggregator.)
  3. Organize some easily visible table on a wiki somewhere that shows the average benchmarks for each model under each card/VRAM combo. Maintain this and update it with some frequency.
  4. Generally improve the benchmarking script beyond the initial implementation if it can be better or make more optimized use of the hardware, so it remains accurate as to how O’s would set up to run inference as the Livepeer implementation improves.

I think this organized benchmark resource would be actionable and valuable to O’s in the community. Anyone agree or disagree, or want to take this on?

5 Likes

Hi everyone :wave:

So I’ve created a spreadsheet where we can keep track of our benchmarking.

Just head to the two tabs on the bottom and fill in your results for avg inference time and avg inference max GPU memory allocated.

This benchmarking sheet includes all 6 models for a total of 22 benchmarks with each model including one with the SFAST tag and one without.

  • sd-turbo
  • sdxl-turbo
  • stable-diffusion-v1-5
  • stable-diffusion-xl-base-1.0
  • openjourney-v4
  • stable-video-diffusion-img2vid-xt

Here is a copy of the benchmark scripts to run. Just replace the -v flag with your local storage volume with the models in it.

It is quite a bit of work to do this. I may be able to write a script to execute all these and automatically put them into the spreadsheet format but we can just start with this.

Hopefully we can get some solid data to work on :+1:

1 Like

Hi @yondon, is there a flag to adjust batch size?
My 4070 laptop is running out of memory on image-video inference.

Yep there is a –batch_size flag.

Also just fyi, if you don’t have a big enough power supply I believe the benchmarking will shut down your machine. Pretty heavy stuff!

The spreadsheet is really usefull… Maybe we can do something similar if/when AV1 transcoding becomes a thing. I’ve added results for the RTX 4080 with some 4060Ti’s 16Gb arriving later in the week to run some test on.

Last water cooler people mentioned keeping models warm in VRAM. It seems that 16GB cards might have some issues with keeping 2 models loaded simultaneously. 3090 with its 24GB of VRAM might be a really effective choice in that sense as it’s relatively cheap secondhand and has that extra headroom.

1 Like

Hi all,

Just some initial takeaways about GPU RAM.

It looks like if your card maxes out it’s dedicated RAM it will pull from your shared GPU RAM which is the RAM on your system.

Once it pulls RAM from your system is slows down the results dramatically. Between 14x and 102x.

ie
DrewTTT’s RTX3080 (10GB) card was 67x slower than Papa Bears RTX4090 (24GB) on Image-to-Video because it borrowed 1.9GB allocated memory and 6.32 reserved memory from the system.

Also if the system RAM cannot share enough RAM to meet the reserved GPU RAM it just throws out of memory.

My prediction is that this increase in this rendering time will be too slow from our network. (unless people don’t mind if things take 1 week to render?)

Is there a way we can restrict the docker image or model to using dedicated memory only?

Alright a couple more updates.

I’ve altered the text-to-image prompt to the same text in Sora’s example which I think is about 77 weights, up from 2(ish?).

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

The VRAM allocation jumped 3.02GB to 5.78GB. —> Up 194%


And I increased the size of the image-to-image initial test image by1.62x (1.36MB to 2.2MB).

The VRAM allocation jumped 4.80GB to 12.85GB —> Up 267%

And with a very large image, 34.9MB, the allocation went up to 26.79GB. —> Up 558%


Based on these findings, it looks like user input is a major factor in whether a GPU can handle the task. And as mentioned earlier, if the GPU starts utilizing shared GPU RAM then the process takes a long time (14x - 102x longer).

Next I’ll look into the recommended memory management functions such as max_split_size_mb to see if we can better allocate resources once the model has determined how much VRAM needs to be allocated.

Wish me luck! :smile:

5 Likes

Ok another update…

So it looks like the increase RAM usage was not due to the change in prompts, but rather a coincidence that something else changed the same day I was testing different prompts.

I cannot find what change happened, but when testing back to back default prompts to new and larger prompts, the RAM requirements are the exact same.

The text-to-image 194% increase and the image-to-image 267% increase seem to be the new default.

In conclusion (for now)

  1. User input does not effect RAM usage, which takes a lot of guessing out of benchmarking and job capabilities (phew!)
  2. But RAM requirements just doubled… maybe the models changed? maybe a setting changed in a dependent package? still looking for it.

So yes many GPUs below 16GB just got wiped out of most jobs. And even @papa_bear’s 4090 cannot do image-to-video inference on it’s own with these default settings.

Im currently investigating two things.

  1. Can we use Windows shared VRAM process to add a significant amount of RAM to GPUs to have the ability to do these tasks? (even if they take a very long time)
  2. Can we combine GPUs parallelly to complete these tasks

In addition, I am trying to find the pain points for end users.
The typical trilemma likely exists here: Speed vs Quality vs Cost
Feel free to vote in the X poll and give your feedback.
This will help guide efforts to how to go about optimizing GPU configurations.

EDIT: MARCH 12 2024

In current testing, it looks like @rickstaa has also identified a memory issue with models maybe still stuck in the GPU memory while running a new test.

I ran a full reboot of the H100 and included call functions to torch.cuda.empty_cache(), torch.cuda.reset_max_memory_allocated(), and torch.cuda.reset_peak_memory_stats() with no luck. I still cannot determine the increase of memory that holds between testing.

In other news, I have identified two functions to lower memory for low VRAM GPUs, we are able to get image-to-video down to 4GB of VRAM with enable_sequential_cpu_offload() and unet.enable_forward_chunking() function enabled and image-to-image down to under 8GB of VRAM with the enable_sequential_cpu_offload() function.

The only downside is inference time went from 24.342 seconds to 137.617 seconds, a 5.7x increase in time.

But this would allow basically any GPU on the Livepeer network today to be able to handle image-to-video tasks.

I will look at incorporating automatic LOW-VRAM logic to allow the script to run seamlessly across all devices. However we still need to fix these random spikes of VRAM that could affect the benchmarking.

3 Likes

Being able to include existing consumer GPUs on the network into Ai work is so important that I cannot put it into words. Speed might get important later for some people but I really don’t think it will be for a veeery long time. Whereas it’s so important that at least a decent number of Os can accept inference jobs.
This is about the general concept of Livepeer that I repeat to everyone around: “We have the GPUs. They are practically free. You don’t need to pay so much to Web2 clouds.” If nodes cannot provide necessary hardware then there will be no free market. No compete on the price. The entire general logic will not apply and so ultimately, Livepeer will not even be cheaper compared to Web2 clouds with AI jobs.
So, thank you!!