Summary
The focus since the last update has been on finishing research prototyping in order to begin development on the subnet deliverables for the AI Video subnet MVP proposed by the AI Video SPE.
Highlights include:
- Prototyped a container for running SVD and frame interpolation to generate ~2s 24fps videos matching the capability of the Stability AI API.
- Reduced disk space requirements for containerized job execution by building a single Docker image with a “runner” app that loads models from shared volumes and can be used to run containers for different model pipelines.
- Created API specs to make it easier to setup the communication flow between broadcaster, orchestrator and containers using codegen tools.
More updates below.
Updates
AI Job Types
The AI Video subnet MVP will support the following jobs:
- text-to-image with a predetermined set of base and custom SD models and LoRAs
- image-to-image with a predetermined set of base and custom SD models and LoRAs
- image-to-video with the SVD models and frame interpolation using FILM
- video-to-video upscaling with ESRGAN
The set of base and custom SD models supported for text-to-image and image-to-image will be determined based on 1) the needs of the consumer app that the AI Video SPE is building and 2) the development effort required.
We intend to support limited customization for text-to-image and image-to-image by supporting a few custom (i.e. finetuned using Dreambooth) SD models and LoRAs so the consumer app can have some flexibility with using different aesthetic styles for generation. In the future, it could be valuable to add flexible support for additional models, but we are choosing to restrict the set of models to keep the scope of work for this milestone manageable. Additionally, at the moment, further customization with techniques like ControlNet and inpainting is not in scope, but can be considered later on depending on how progress on higher priority tasks is coming along.
Disk Space and Model Management
The containerized job execution demo from the last update used Cog to bundle a model, inference code and a REST API server into an image that could then be run as a container. A major downside of that approach was that we ended up with a large 20GB+ image per model and each image included the same large libraries (i.e. PyTorch, CUDA libraries, etc.). As a result if you wanted to support inference for 5 models you could end up with 100GB+ of images - some of this space would be taken up by model weights, but much of it would be taken up my duplicated library code!
An alternative approach that has worked well is illustrated below:
- A single Docker image with a runner app that bundles all inference code dependencies and that can configured to setup a specific pipeline (i.e. text-to-image, image-to-video) based on the job that an orchestrator needs to be done.
- The image can be used to run multiple containers each with different pipelines running on separate GPUs.
- The container can read/write to a mounted volume acting as shared storage with the orchestrator so all model weights can be stored in a single place instead of being bundled into images.
The end result is a single Docker image that is ~5-10GB (size subject to change as development continues). The model weights still take up space which is unavoidable, but this setup should still be much better than having multiple Docker images with duplicated library dependencies.
Sequential vs. Batched Inference
A common way to increase the inference throughput of a single GPU is to batch multiple requests so that they can be executed in parallel on the GPU. In practice, this typically involves creating a queue and pulling the next N requests from the queue very M seconds so the N requests can be sent to the GPU in a batch. The tradeoff here is that the latency of an individual request can end up increasing if the request needs to sit on a queue for awhile before it is included in a batch.
According to this blog post, the tradeoff between throughput and latency when using the HuggingFace Diffusers library for running diffusion pipelines (i.e. for SD and SVD models) is pretty bad and the author opted to just stick with sequential requests i.e. only processing a request on the GPU after the previous one completes. I have not had a chance to do a more thorough investigation here, but for now the plan is to use sequential inference as the default for now and to re-visit batching later on.
Code
No code ready to be run yet, but for anyone that is interested in following along, development will be happening in the following repos:
We’ll also be starting to create good for open source contribution issues in the next week or two so for anyone interested in in those stay tuned.
Timeline
- Create a demo of a hosted API processing text-to-image, image-to-image and image-to-video requests using a single unpaid orchestrator
- Target date: 1/22
- Release a benchmarking tool that orchestrators can use to start experimenting with the workloads for the job types for the subnet MVP
- Target date: 1/29
- Create a demo of a hosted API processing text-to-image, image-to-image and image-to-video requests using multiple unpaid orchestrators
- Target date: 2/5
Note: These demos are just to demonstrate software functionality and community participation w.r.t. to running the software is a separate thing.
Additional milestones on the timeline to be added in future updates.