AI Video Compute Technical Update 3/18/24

Summary

The focus since the last update has been on completing the engineering milestones required for an e2e paid AI inference job execution workflow where a broadcaster can send a request along with a payment ticket to an orchestrator and the orchestrator can process the payment (and redeem any received winning tickets) and execute the job.

Highlights include:

  • Implemented an updated selection workflow that takes into account the models advertised by orchestrators per capability and whether the orchestrator has the model weights “warm” (i.e. loaded into GPU VRAM).
  • Implemented a payment workflow for the text-to-image, image-to-image and image-to-video capabilities which is demoed here.

Updates

Capability and model aware selection

In the last update, we noted that orchestrators are able to advertise supported models with capability constraints. However, at the time, a broadcaster only filtered orchestrators based on whether they supported a model and whether the model was warm - the broadcaster did not use this information to prioritize orchestrators during selection. Furthermore, previously, the broadcaster used a naive round robin selection strategy to decide which orchestrators to send a request to.

As of the this update, the following improvements have been made to the broadcaster implementation:

  • The broadcaster will first select from the pool of orchestrators that have the model warm and then select from the pool of orchestrators that do not have the model warm
  • The broadcaster will use the same selection strategy used for transcoding, that considers stake, price and latency, when selecting orchestrators [1]

[1] The selection strategy has been a topic of debate within the community so for now the intent is to not make any changes to it as it pertains to AI capabilities and leave improvements to the strategy (whether it be tweaks to weights, algorithms or even the introduction of a more modular system) to be addressed separately.

AI capability pricing

As of this update, the ai-video branch of go-livepeer allows an orchestrator to advertise a price per pixel for each capability + model ID pair - they might charge X for text-to-image with stabilityai/sdxl-turbo and charge Y for image-to-video with stabilityai/stable-video-diffusion-img2vid-xt. The prices would be set in a config file (that is also used to specify the supported models) specified via the -aiModels flag that looks like this:

[
  {
    "pipeline": "image-to-video",
    "model_id": "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    "price_per_unit": 3390842
  },
  {
    "pipeline": "text-to-image",
    "model_id": "stabilityai/sdxl-turbo",
    "price_per_unit": 4768371
  },
  {
    "pipeline": "image-to-image",
    "model_id": "stabilityai/sdxl-turbo",
    "price_per_unit": 4768371
  }
]

The price varies per capability because the compute cost of generating an image can differ from that of generating a video. The price also varies per model ID because the compute cost of using one model can be differ from the cost of using another for the same capability (see SD1.5 vs. SDXL).

The compute cost of a capability + model ID can also be influenced by request parameters such as the output resolution. The current implementation accounts for the output resolution by calculating the payment required for a request based on the following formula:

output_pixels = output_height * output_width * output_frames
payment = output_pixels * price_per_pixel

Generally, if a request generates more pixels, the fee for the request increases. For example, a text-to-image request for a 1024x1024 image will cost more than a text-to-image request for a 512x512 image. And an image-to-video request for a 576x1024 video with 25 frames will cost more than an image-to-video request for a 576x1024 video with 14 frames.

There may be other request parameters that could influence the compute cost of a request. At the moment, these parameters are not yet factored into the pricing of requests and many of these parameters are also not adjustable yet by users. The current implementation is just a starting point and the intent is for it to evolve over time to more accurately capture the costs incurred for using a model.

The demo also references a pretty rough pricing worksheet that was used to derive a price per pixel to charge per capability + model ID. The methodology used in the worksheet was:

  • Get the price per request for text-to-image and image-to-video charged by SaaS APIs (specifically Together.AI and Stability.AI
  • Assume a specific resolution (and # frames for video) for the output and # of inference/denoising steps (note: this is not factored into pricing right now and the # of inference steps is whatever the diffusers library sets as the default for now)
  • Divide the price per request by the number of output pixels in order to get a reference price per pixel
  • Use that price per pixel for a capability + model ID pair

As mentioned earlier, the pricing implementation will need to be iterated on over time and additionally the methodology for orchestrators to determine how to price a capability + model ID could use more thought too! The community is welcome to not only play around with the worksheet and improve on it, but also to use it as a jumping off point to make improvements to how pricing could work overall.

AI capability payments

As of this update, a broadcaster will use the pricing implementation (described in the previous section) to create a payment with N tickets such that the cumulative ticket EV will cover the fee for a request. The payment will then be processed by the orchestrator in the same way that payments are processed by orchestrators for transcoding today. The implementation re-uses the existing probabilistic micropayment system used for transcoding.

Next Up

  • Design a solution for mainnet orchestrators to advertise a separate service URI that can be used on the subnet
  • Design a solution for collecting metrics on the subnet
  • Testing with the most recent version of the ai-video branch of go-livepeer now that there are basic implementations of all the required components to complete an e2e workflow
3 Likes

Great job @yondon!
This is huge progress I’m glad to see the re-use of current systems to speed up the deployment of AI jobs. ie tickets/payments, selection algo etc

The one thing I would like to discover is the acceptable time to return jobs in the selection algo. Currently the transcoding algo enforces pretty strict results, segments need to be real time (or close to real time with VOD).

My question arises with my testing with sequential cpu offloading. If orchs want to utilize low vram cards, what is the cutoff for switching between orchs?

Based on my testing an 8GB card can do image-to-video with a 5-10x longer timeframe.

As a concrete example, @papa_bear’s 4090 can do image-to-video in 50 seconds, while my 4060 (8GB) card can do it 517 seconds.

I know this is a hard question to answer so maybe a good start might be with @huangkuan and what he thinks an acceptable UX time would be for the Grove app?

Happy to hear any thoughts.

For our use case, the shorter wait time, the better it is. I don’t think people are going to wait 500+ seconds for a meme gif/mp4 video. Ideally, the wait time I would like to see is sub 30 seconds. We won’t be able to find the accurate threshold of people’s tolerance until after the launch of the product.

Good to know! Do you know how you would handle timing out an Orch or monitoring if the Orch is actively doing the work? I know this is more a question to @yondon but I’m just thinking through the process of how GPUs would be configured.

Here are a few scenarios.

  1. You want results to be returned in 30 seconds but no orchs are actually able to return results that fast. Do you send out the request, wait for an orch to do 50% of the work and then give up on them after 30 seconds just to switch to a different orch that once again can’t do it in 30 seconds, leaving the job in an endless loop of work not getting done? Would we keep extending the timeout until an orch can do it? By that point it’s been much longer than 30 seconds.

  2. You want the work done in 30 seconds but the orch that receives your job takes 500 seconds, how do we keep track of orchs that perform within our time horizon for subsequent jobs?

  3. A GPU is already on another job, are we going to wait until the GPU has finished the current job before accepting a new job?

  4. Time is no factor for the UX of a particular app, would we just then offer a lower price for slower jobs and allow that app to tap into super cheap compute?

I know these are rapid fire questions and a frontend app shouldn’t have these concerns, they should be abstracted away by the network. But there may be a price factor involved with getting results faster or the need for the app the explicitly tell the network the how much VRAM it should have for this job and be implement it into the selection algo.

1 Like

I agree with you that I think there’s opportunity on the network for nodes to advertise some of the key info such as VRAM available, and price differently based on those requirements. I’d imagine some apps with low latency generation requirements would pay a lot more for the fast response times, and apps that don’t have that requirement would rather pay less for the lower end cards to do the work async.

Your questions about the O timeouts, failovers, and the ultimate potential designs for bringing this to the one unified Livepeer network (after learning on a more experimental sub network first), are probably all future work. And I know @rickstaa will be heavily involved in experimenting with that stuff, even on the subnet.

3 Likes

Agreed, we are going to need some experiments before we dig into deep questions like this.

As for the subnet testing, we should likely just allow GPUs with 24GB of RAM or higher just to get a sense of what an “ideal” network would look like. Adding low RAM GPUs will be future efforts.

I’ll echo this comment, but here are also some personal initial reactions on things to explore after starting subnet testing:

  1. The current go-livepeer implementation involves a simple request-response between the B & O which means that it has the problem you described where a B risks waiting too long for an O that is slow to complete a job. In order to address this problem, I think an improvement worth exploring could be to use an interactive protocol between the B & O where the O needs to send an update message (or intermediate output which could be useful to show “preview” images/videos in a UI prior to the final output being ready) back to the B every N inference steps and B would pay the O for every N inference steps. The B could then measure the speed of O in terms of # of inference steps/sec and incorporate that metric into its selection decisions.

  2. See above.

  3. The current go-livepeer implementation only allows 1 job on a GPU at a time so the current job needs to complete before the next one can be run. There could be improvements worth exploring here such as job batching i.e. wait up to N seconds to generate M images with a single run on the GPU (although some light testing previously seemed to indicate that batching with diffusion models has limited throughput gains - always possible that something was missing from those tests though).

  4. I think that would be interesting to explore.

2 Likes