AI Video Compute Technical Update 4/08/24

Technical Update

Hey all :wave:,

I’m excited to share our latest technical update with you! After a month of part-time onboarding to familiarize myself with our project’s structure and codebase, I’m now teaming up with Eli and our skilled team to spearhead our core AI development initiatives. This transition occurs as Yondon shifts to an advisory role.

From here on out, I’ll be your go-to for tech updates. Please feel free to reach out with any questions, either here on Discord or during our weekly Water-cooler calls, where I’ll be available for the first 30 minutes to dive deeper into any discussions.

I’m excited about the fantastic achievements we will realize as a community!

Summary

Since the last update, we have focused on transitioning Yondon’s initial AI subnet implementation from off-chain and test networks to the Arbitrum Ethereum mainnet. Our efforts have centred on deployment and extensive data gathering to identify potential enhancements and ensure the AI subnet’s foundational integrity and performance.

Highlights

  • Log and Metrics Collection: Implemented a system for collecting logs and metrics, enhancing our monitoring capabilities and issue identification within the AI subnet.

  • AI Subnet Gateway Deployment: Successfully deployed the Livepeer AI subnet Gateway on-chain, enabling AI inference request processing.

  • AI Subnet Onboarding and Documentation: Designed an onboarding pipeline and drafted documentation to facilitate the addition of AI node operators to our network.

  • Onboarding of AI Node Operators: Welcomed the first 6 AI node operators in our pre-alpha test group, followed by successful on-chain ticket redemptions.

  • Fallback Solution: Introduced a fallback mechanism to sustain demand for the dApp during the subnet’s nascent stages.

  • Load Testing: Conducted initial load tests to pinpoint potential bottlenecks and areas for improvement.

  • Community dApp Launch: Successfully managed the community dApp Community GIF Competition, providing an engaging community event and valuable data on subnet performance.

  • Continuous Onboarding and Improvements: We added another batch of nine AI node operators and enhanced the livepeer-payouts-bot to improve community visibility of AI ticket redemptions directly in the Discord server.

Currently, our network comprises 15 AI node operators, has processed 1.34k text-to-image and 510 image-to-video jobs, and has issued 9.18K tickets, with 57 winning tickets redeemed on-chain, totaling 0.068 ETH :rocket:.

Terminology Brief

For clarity, here’s a quick glossary of key terms used in this update:

  • Mainnet: The primary version of the Livepeer protocol, utilizing smart contracts on the Arbitrum One blockchain.

  • Mainnet transcoding network: Denotes the set of broadcasters and orchestrators using the mainnet to coordinate the execution of transcoding jobs.

  • Mainnet AI subnet: A set of broadcasters and orchestrators using the mainnet to coordinate the execution of AI inference jobs.

  • AI Subnet Orchestrator: This is a specialized node tasked with carrying out AI inference operations within the subnet.

  • AI Subnet Gateway: A designated node that routes AI tasks to the correct AI Orchestrators for processing.

  • Pre-alpha and Alpha Phases: Stages in the AI subnet onboarding process for testing in controlled and more open environments, respectively.

  • Inference Pipelines: The text-to-image and image-to-video AI capabilities within the subnet.

Notable Updates

Metrics Collection and Monitoring

We’ve rolled out an opt-in promtail agent for Orchestrators to supply logs and metrics, aiding early issue detection and network monitoring. Active monitoring has already unearthed several improvement opportunities.

AI Subnet Onboarding: Comprehensive Update

In the past few weeks, our team has concentrated on significant advancements with the Livepeer AI subnet, particularly the deployment of the AI subnet Gateway on-chain and the commencement of AI node operator onboarding. This structured process is divided into pre-alpha and alpha phases, each meticulously designed to incrementally enhance our network’s capacity and collate crucial performance data.

Deployment of the AI Subnet Gateway

We’ve made a significant leap by deploying the Livepeer AI subnet Gateway on-chain (0x012345dE92B630C065dFc0caBE4eB34f74f7FC85). This Gateway is now operational, processing AI inference requests and channelling them to the AI node operators within the subnet paying tickets along the way. Currently, we do not provide documentation on deploying your own AI subnet Gateway, but this will be provided in the future.

Orchestrator Onboarding Achievements

Pre-alpha Phase Successes:

Alpha Phase Progress:

  • Following a successful pre-alpha, we’ve onboarded an additional cohort of nine AI node operators from the alpha test group, reaching our milestone of 15 AI node operators performing AI inference jobs and providing metrics.

As we continue to evolve and expand, those unable to join during the alpha phase should not be disheartened. We are developing an automatic discovery algorithm that will enable any of the top 100 orchestrators to seamlessly integrate into the AI subnet, ensuring wide-scale participation and network enhancement.

Fallback Solution Design

We’ve crafted and deployed a fallback solution to guarantee sufficient dApp demand. This solution ensures network sustainability during the early stages, combining cloud functions and orchestrator nodes for seamless demand management. We will scale this solution down as the network matures.

Load Testing Overview

Following successfully onboarding the pre-alpha participants and ensuring the ticket redemption process was efficient, we collaborated with @CJ for the AI subnet’s inaugural load tests. These tests assessed the subnet’s ability to manage the anticipated dApp workload. We structured the testing into a low simulated load test and a high real-time load test during the Community GIF Competition to evaluate the AI subnet’s performance under varying stress levels.

Simulated Load Test Details

For the simulated tests, we deployed specific models on the 6 AI node operators:

The text-to-image tasks achieved a 100% success rate, demonstrating robust performance. However, the image-to-video tasks encountered a 69% failure rate, attributable to the intensive processing requirements (23-60 seconds per request) that exceeded our current capacity. This insight led to activating a fallback solution for the Community GIF Competition, aiming for uninterrupted service during peak demands. We’re working on model optimizations to reduce processing times and expand our operator network for enhanced capacity.

Community dApp Launch and Results

The Community GIF Competition marked a practical test of our load preparations, engaging the community and generating valuable performance data. We added 3 Tensordock orchestrators to bolster our capabilities and implemented a dynamic scaling solution via a cloud function. This event featured a total of 9 AI node operators, with models pre-loaded or dynamically assigned as needed.

Notable metrics from the event (held from 11:30 AM – 12:30 AM EDT on October 7th) included:

  • Text-to-Image Jobs: 168 by AI subnet, 24 by fallback.

  • Image-to-Video Jobs: 82 by AI subnet, 37 by fallback.

  • On-Chain Payments: 316 was created, and one ticket was redeemed valued at 0.0012 ETH.

Despite lower-than-anticipated ticket redemptions, primarily due to adjusted job pricing for deposit sustainability, our network successfully handled 87.5% of text-to-image and 68.9% of image-to-video jobs. This success underlines our capacity to manage significant portions of dApp demands, bolstered by ongoing model optimizations and network expansions.

Moving forward, we’re confident in our ability to support full dApp load, thanks to forthcoming enhancements and broader orchestrator engagement.

Known limitations

The AI subnet is currently in its early stages, and some known limitations will be addressed in the upcoming weeks:

  • The AI subnet can only support one container per capability (i.e. pipeline) per orchestrator. This means an orchestrator can only run one text-to-image container and one image-to-video container simultaneously, even though they might have multiple GPUs available.

  • The AI subnet can currently only support one AI inference container per GPU. This means we can not fully utilize the VRAM of server GPUs with higher VRAM when linked to the AI subnet.

Identified bottlenecks and areas for improvement

During our load tests, among some more minor issues, the following bottlenecks and areas for improvement were identified:

  • Selection Algorithm Needs Refinement: The selection process the Gateway uses to distribute AI jobs among orchestrators requires optimization. Currently, it tends to repeatedly choose the same orchestrators for new AI inference jobs, even when they’re already busy. This limitation stems from a strategic choice to rapidly deploy our subnet by adapting the Mainnet Transcoding Network’s selection algorithm for a different use. Furthermore, the system encounters an insufficient capacity error if it cycles through the list of available orchestrators four times without finding an available one. Introducing a queuing mechanism at the broadcaster level and improving the selection algorithm could significantly enhance this aspect.

  • Orchestrators get paid for failed jobs: We currently pay the orchestrators up front even though they might not have enough capacity to process the job. Because of this, orchestrators also get paid for jobs where they just returned an insufficient capacity error.

  • Some Os reported problems with the AI subnet on multi-GPU machines: Some orchestrators reported issues with the AI subnet on multi-GPU machines where containers kept reloading even though they were supposed to be kept warm.

  • VRAM is not released when the container is stopped: We did not release the VRAM when an AI orchestrator was shut down. This is fine because we clean containers on startup, but releasing the VRAM would ensure clarity.

  • Container Capacity Management Issue: We’ve identified a challenge affecting inference container VRAM capacity management due to specific request types. We’re addressing it to prevent potential issues and ensure robust network performance.

  • We identified several silent container crashes: We identified several silent container crashes that were not communicated to the orchestrator during startup and caused the O to shut down after a timeout period.

  • Broadcasters were sometimes waiting for requests without having any feedback: As pointed out by our app developers, it would be very beneficial to have a communication protocol for long-running requests.

Upcoming milestones

In the coming two weeks, we will focus on fixing the identified bottlenecks and areas for improvement. Additionally, we will be working on the following milestones:

  • Implement an automatic discovery algorithm: We will implement an automatic discovery algorithm that allows any of the top 100 orchestrators to join the AI subnet.

  • Improve selection algorithm: We plan to implement a queuing solution at the broadcaster time to give the network a bit more time before throwing an insufficient capacity error.

  • Allow multiple containers per capability per orchestrator: We plan to allow multiple containers per capability per orchestrator to utilize the full capacity of orchestrators with multiple GPUs.

7 Likes