Metrics and SLA Foundations for NaaP

Thank you to everyone who reviewed the earlier pre-proposal and shared detailed feedback in the forum and during the Watercooler. The concerns raised around scope, cost, architectural risk, and MVP clarity were well-founded and directly informed this revision.

This updated pre-proposal reflects a deliberate reset toward a smaller, clearer Network-as-a-Product MVP. The scope has been significantly narrowed, the budget reduced, and the architecture simplified to prioritize time-to-value, reuse of existing Livepeer infrastructure, and immediate usefulness to gateways, orchestrators, and ecosystem teams.

Below is the revised pre-proposal. We welcome the community’s review and feedback on the updated scope, design, and framing. We will be present on this coming Monday’s Water Cooler for discussion.


Cloud SPE Pre-Proposal: Network-as-a-Product (NaaP) MVP – SLA Metrics, Analytics, and Public Infrastructure


Abstract

This pre-proposal seeks treasury funding for the Livepeer Cloud Special Purpose Entity (SPE) to design, build, and operate a focused Network-as-a-Product (NaaP) MVP for SLA metrics, analytics, and public visibility.

The objective of this work is to make the Livepeer network measurable, comparable, and trustworthy at a network level by delivering a small but complete set of standardized performance, reliability, and demand metrics. These metrics will be publicly observable and designed to support gateway providers, orchestrators, and ecosystem builders evaluating Livepeer as production infrastructure.

This MVP intentionally prioritizes time-to-value, architectural simplicity, and reuse of existing Livepeer infrastructure, while establishing a durable foundation for future SLA-aware routing, scaling, and productization efforts led by Livepeer Inc, the Livepeer Foundation, and the community.


Rationale

As Livepeer advances toward the Network-as-a-Product vision, predictable service characteristics and transparent performance signals become essential. While the network supports real workloads today, participants lack a shared, network-wide view of performance, reliability, and demand that can be used to assess suitability for production use.

Community discussions around earlier drafts of this initiative strongly aligned on the problem, while raising important concerns around scope, cost, architectural risk, and MVP clarity. This pre-proposal reflects that feedback by narrowing focus to a practical MVP that:

  • Demonstrates clear value with minimal complexity
  • Leverages existing data sources and pipelines wherever possible
  • Avoids protocol changes, enforcement mechanisms, or premature decentralization
  • Produces immediately usable outputs for real network participants

Key challenges addressed by this proposal include:

  • Fragmented metrics: Existing performance and reliability data is dispersed across systems and difficult for non-core teams to consume.
  • Limited network-level visibility: Gateway providers and orchestrators cannot easily compare performance across regions, workloads, or peers.
  • Adoption friction: Without transparent, shared metrics, external developers and partners struggle to evaluate Livepeer for serious workloads.
  • Missing foundation for NaaP evolution: Future SLA-aware routing, scaling, and automation require a trusted measurement layer first.

The Cloud SPE is well positioned to deliver this work as neutral, public infrastructure, building on its prior experience operating gateways, test tooling, dashboards, and analytics for the Livepeer network.

Importantly, this proposal does not attempt to enforce SLAs, modify protocol incentives, or introduce new routing logic. Its purpose is to establish shared measurement and learning infrastructure as a prerequisite for those future decisions.


Deliverables

The NaaP MVP will deliver a constrained, end-to-end metrics system focused on observability and learning inspired by the NaaP product MVP and Foundation roadmap.

1. Core SLA Metrics (MVP Scope)

  • A standardized set of network, performance, and reliability metrics sufficient to evaluate orchestrator and GPU behavior across workflows.
  • Metrics sourced primarily from job tester gateway and orchestrator-emitted telemetry, with targeted additions only when other Gateways opt-in.

2. Network Test & Verification Signals

  • Operation of one or more reference load-test gateways to generate consistent, reproducible performance signals for live AI video pipelines.
  • Public test scenarios (aka test datasets) designed to reflect real workloads while remaining transparent and community-verifiable. These will be captured in Github.
  • Test results contributed into the same analytics layer as organic network traffic to enable comparison (when other Gateways participate).

3. Analytics & Aggregation Layer

  • Lightweight ETL and aggregation pipelines to transform raw metrics into network-level views.
  • Computation of a small number of derived indicators as outlined in the Metrics Catalog
  • Data structured for efficient querying without requiring dashboards to load raw job data.

4. Public Dashboard & APIs

  • A standalone public dashboard presenting live and historical metrics.
  • Public, read-only APIs for aggregate SLA scores and hardware.
  • Clear paths for gateways and ecosystem teams to consume the data directly or mirror it into their own analytics systems.

5. Operations & Stewardship

  • Ongoing operation of testing, analytics, and dashboard infrastructure.
  • Maintenance, monitoring, and community support for the MVP for 1 year.

Any scope not outlined here is not part of the Deliverables and out of the scope of this proposal.


Key Milestones

Milestone 1 – Metrics Collection & Aggregation

  • Define and implement the minimal metrics set
  • Aggregate existing telemetry into a unified analytics layer
  • A basic dashboard showing sample data flowing end to end

Milestone 2 – Test Signals & Derived Analytics

  • Deploy reference load-test gateways
  • Launch a public dashboard with core views
  • APIs for ecosystem consumption

Milestone 3 – Stabilization & Review

  • Harden infrastructure for reliability and cost efficiency
  • Document metrics, assumptions, and known gaps
  • Review outcomes with the community to determine next steps

Timeline

Delivery is anticipated to take approximately six months (and already underway as of November 2025). This is dependent on the team’s development velocity and subject to change. Preliminary design and validation work has begun to reduce delivery risk.

  • November 2025 - Works began on original proposal and discovery process
  • February 2026 – Milestone 1: Metrics Collection & Aggregation
  • March 2026 – Milestone 2 – Test Signals & Derived Analytics
  • April 2026 – Milestone 3 – Stabilization & Review

Budget

Total Requested Budget: $90,000

This budget supports:

  • Engineering work to aggregate, validate, and expose SLA-relevant metrics
  • Development of Load Testing Gateway (AI Job Tester + Gateway enhancements) and Network Data Scraper
  • Development of minimal analytics and public-facing dashboards
  • Development of DevOps infrastructure and automation
  • Operation of testing, analytics, and storage infrastructure for approximately one year
  • Ongoing maintenance, documentation, and community support

The budget is intentionally sized for a thin but complete MVP, designed to validate assumptions, inform future investment, and avoid long-term commitments before value is demonstrated.


Closing Note

This pre-proposal reflects extensive community and Livepeer Inc feedback and represents a deliberate step toward a simpler, clearer, and more actionable NaaP MVP.

By focusing on shared measurement rather than enforcement or protocol change, this work aims to give the Livepeer ecosystem a common understanding of network behavior today — and a solid foundation for deciding what to build next.

7 Likes

Hello @speedybird.

We think that this proposal improves considerably the offered value to the protocol via:

a) reduced architectural complexity(and future debugging timers)
b) implementation of data aggregation instead of building metric solutions ex nihili.
c) introduces a research phase(milestone 1) where the team assesses the needs of the ecosystem.

To clarify feedback from the last Watercooler: protocol stakeholder(particularly orchestrators) are structurally aligned with Livepeer’s long-term success and are often incentivized to contribute under economics that prioritize network growth over short-term premium pricing. Maintaining this alignment is important, as consistently higher cost structures for core infrastructure work can limit the protocol’s ability to onboard and incentivize the next wave of startups needed to drive sustainable fee growth. The revised pricing structure strongly articulates this position and represents a near game theoretic optimum for aligning incentives across protocol stakeholders.

Some questions that might help you during the phase 1 of the deliverable:

  • What you need to know as a new builder that just entered the community and wants to engineer some workloads? Would it be helpful to have a dashboard with the GPUs and system(s) spec that each orchestrator supports?

  • How quickly can the proposed solution integrate metrics for new workloads? Will it need feature/workload specific engineering or it is integratabtle in a plug and play manner?

  • How should we aggregate data from maintainers of workloads, should there be a standardized process where any new builder can setup an aggregation endpoint that Cloud SPE can consume?

Honestly, great work with implementing all the feedback from the community. We would be very glad to support your effort.

4 Likes

Kudos to @MikeZupper and @speedybird for the rigorous effort put into shaping this proposal.

From my side, I can confirm that the proposal is strongly aligned with the Make Network Data More Observable roadmap item. It covers all of the must-haves and should-haves, and addresses most of the nice-to-haves outlined in that brief.

It is also worth emphasizing that the original roadmap item was intentionally high-level, meant to spark exploration and direction rather than define an exhaustive specification. Significant work has since been done by both the Cloud SPE team and the community to refine this into a concrete proposal, grounded in explicit needs identified for the network as a product initiative.

Overall, I believe this is strong work. The team is well-positioned for success, and any minor adjustments can be handled pragmatically during execution.

5 Likes

Great to see this next iteration. Getting the right scope here has meant dealing with a lot of complexity. Huge shout out to @MikeZupper and @speedybird for integrating feedback quickly and effectively.

Moving forward, I would encourage regular collaboration with Inc, the Foundation and a few other core contributors who see this piece of work as critical to the Livepeer network being a go-to, transparent network for video / AI compute. I know that @Mehrdad will be a great thought partner here for how we can keep the work transparent and accountable to the milestones.

Awesome job. Excited to see the proposal onchain. :fire:

5 Likes

Being the key stakeholder for this proposal (milestone 1), from Inc, I really appreicate cloud spe team take the initiaitve to respond to community review comments, with both speed, and open mind flexibility.

the new proposal is in line with what Inc (daydream) project is looking for in terms of the fundemental network obserability. It is very important step for both Inc and community to have a measuable network foundation, so as gateway providers can bring the SLA to their users. All these start from a transparent, systematic, and extensible way to collect, and consume the key ai job metrics, to start with. this is exactly what this 2nd draft of proposal try to accomplish.

to ensure the accountablity, and creditablity of all parities invoved, foundation has helped us to initate developer chat weekly to inform the community what is happening, and also we have bi-weekly sprint review to ensure the progress is clearly demoable. and also milestone key deliverables are outlined, by could spe very clearly in the proposal.

on behalf of Inc, I endorse this proposal, and committed to work with cloud spe to deliver it.

7 Likes

Thank you everyone for your feedback and support! After the discussion in the water cooler and support expressed in the forum, we intend to put this up for a vote at the end of today or early tomorrow. We look forward to the next phase of this project and continue feedback from the community.

3 Likes

As an update, this pre-proposal has been promoted to a full Treasury proposal. Please participate before January 25th by casting your vote.

We are looking forward to the next steps for Livepeer and making a meaningful contribution to the roadmap!

1 Like

Cloud SPE — Update #1

Period: January 1, 2026 – January 31, 2026

Status: On track

Summary:

Following the successful passage of our Treasury Proposal, we have established our core data infrastructure and completed a comprehensive inventory and quality validation of existing network data. These foundational steps ensure the accuracy of our analytics pipeline as we approach the completion of Milestone 1.

Completed Deliverables:

Milestone 1: Initial Infrastructure & Data Validation partial delivery - Feb 6 Due date

Built the initial infrastructure to support data ingestion, processing, and deployment to the analytics stack, resulting in a basic initial functional end-to-end flow to a basic Grafana dashboard.

ETA for Next Update: March 1, 2026

Planned by Next Update:

  • End-to-End Dashboard: Deployment of a public Grafana dashboard displaying several key live analytics metrics.
  • Data Integrity: Completion of all data quality validation and automated capture processes.
  • ETL Pipeline: Significant progress on the Extract, Transform, Load (ETL) processing pipeline to handle complex data transformations.
  • Infrastructure Stress Test: Validation of the current infrastructure’s stability and scalability to handle increased data loads.
4 Likes

Cloud SPE — Update #2

Period: February 1, 2026 – February 28, 2026

Status: On track


Summary

During February, we progressed from foundational setup into active operational validation. With Iterations 1–4 now complete, the analytics stack is running real-time workloads and supporting live performance testing across multiple regions.

The core data pipeline, schema design, and validation work are now finalized, positioning the project to transition from alpha infrastructure into production-grade APIs and SLA measurement in the next phase.


Completed Deliverables

Milestone Progress: Infrastructure, Real-Time Testing & Pipeline Implementation

Expanded the analytics platform from initial infrastructure into a fully operational real-time testing and processing environment, enabling continuous measurement of orchestrator performance and AI workload characteristics.

  • Iterations Completed: 4 of 7 total iterations

  • AI Job Tester:
    Running real-time AI video job tests across SEA, MDW, and FRA regions

  • Grafana Dashboard (v2):
    Grafana

  • Data Layer:

    • Data validation and quality processes completed
    • Finalized schema design and query patterns
  • Processing Pipeline:

    • Apache Flink data pipeline designed and implemented
  • APIs (Alpha Release):

Note: All data is currently from the Cloud SPE AI Job tester and does not include production job data from Daydream.

These deliverables collectively establish the first fully integrated analytics loop:

test → ingest → process → visualize → query via API


GitHub Links:


ETA for Next Update

March 31, 2026


Planned by Next Update

  • Production Workload Ingestion
    Integrate Daydream data to reflect real application demand patterns

  • API General Availability
    Release finalized versions of all analytics APIs

  • SLA Scoring
    Deploy the production SLA scoring algorithm and provide API access

  • Gateway Performance Testing

    • Orchestrator swap rate analysis
    • Selection algorithm validation
  • Documentation (Drafts)

    • API specifications
    • Analytics pipeline architecture
    • Data schema and design guides
  • Production Readiness

    • Security hardening
    • Infrastructure scaling and performance optimizations
1 Like

Are there plans to expand analytics coverage to other network pipelines, or is Cloud SPE currently focused exclusively on go-livepeer and Daydream/Scope workloads?

1 Like

We definitely want to expand to other workloads. The initial scope is focused on daydream (Due April 10 2026). The Cloud SPE has done a lot of work with custom BYOC pipelines we plan to bring to the network. Analytics will be key for BYOC workloads.

2 Likes

Cloud SPE — Update #3

Period: March 1, 2026 – March 31, 2026

Status: On track

───

Summary

March marked the most intensive phase of the project — transitioning from alpha infrastructure into a production-ready analytics platform. We completed 6 of 7 total iterations, finalizing the data pipeline, hardening infrastructure, expanding the API surface well beyond the original scope, and delivering the first draft of SLA scoring logic.

The single largest effort this period was data analysis and quality assurance across the analytics pipeline. This work, spanning over 5 weeks across M1 and M2, fell squarely into what we call “hidden costs” — unplanned but essential work that surfaces in any data-intensive project. Getting the data right before building on top of it was non-negotiable.

With M2 now complete, the platform is live, usable, and ready for community consumption. Our final milestone (M3) is targeted for April 10, 2026.

───

Completed Deliverables

Milestone Progress: Production Pipeline, Hardened Infrastructure & Expanded API

Evolved the analytics platform from an alpha environment into a production-grade system with hardened infrastructure, a finalized data pipeline, and an API surface that significantly exceeds the original scope.

• Iterations Completed: 6 of 7 total iterations
• Finalized Analytics Pipeline:
End-to-end data processing pipeline fully implemented and validated for production use.
• Daydream Data Loading via Kafka (MirrorMaker 2):
Ingested 35+ million historical records into the analytics platform using Apache Kafka with MirrorMaker 2, enabling comprehensive network-wide analysis.
• Secured & Hardened Infrastructure:
Multiple Cloud SPE Ubuntu servers running Kafka, ClickHouse, API Server, Prometheus, Grafana, and the Analytics Pipeline — all hardened for production workloads.
• Expanded API Server (46 Endpoints):
Delivered the 3 promised APIs — GPU Metrics, SLA Compliance, and Demand — plus an additional 43 endpoints exposing a wide range of network data slices. We initially put the APIs inside of the legacy leaderboard-serverless api, but decided that service was not the ideal place for the suite of new apis going forward. In fact we will probably end up deprecating it. The new API is live at: NAAP Analytics API (NAAP Analytics API)
• Improved AI Job Tester:
Significant enhancements to live video-to-video workload testing:
• High, Medium, and Low prompt complexity testing
• Orchestrator cap detection
• Improved error handling
• Individual orchestrator tests plus full-network tests covering orchestrator selection, swap behavior, and failure rates
• SLA Scoring Logic (First Draft):
Implemented the initial SLA scoring model. The logic is usable and live today — designed to be adjusted and refined as users consume and provide feedback on the data.
• Data Analysis & Quality Assurance:
Extensive validation and quality checks across the analytics pipeline. This effort spanned 5+ weeks across M1 and M2 and represented the largest single unplanned cost in the project — a necessary investment to ensure data integrity before building production APIs on top of it.

───

Key Resource

Same repositories as prior updates:

• Naap Analytics Pipeline & API (GitHub - Cloud-SPE/livepeer-naap-analytics · GitHub)
• Project Plans NaaP MVP: Make Network Data More Observable · GitHub
• Grafana Dashboard https://grafana.livepeer.cloud/
• API Endpoints NAAP Analytics API

───

What’s Next — M3 (Final Milestone)

ETA: April 10, 2026

• Final forum post with full project summary
• All final documentation published
• Project wrap-up and handoff
• Analytics Dashboard & API Demo and “Peek” behind the scenes of the analytics pipeline.

───

Looking Ahead

This API project represents the beginning of network analytics infrastructure for Livepeer — not the end. To continue and expand this work beyond M3, it will require community involvement or additional treasury funding. We will continue working with the Livepeer Foundation and the NaaP platform team to scope and adjust the API and Analytics Pipeline to suit the network’s evolving needs.

NAAP Proposal Close-Out — All Deliverables Complete

Treasury Proposal: On-chain
Pre-Proposal Thread: Metrics and SLA Foundations for NaaP
Project Board: GitHub
April 10, 2026

We’re pleased to report that all five deliverables from the approved NAAP treasury proposal have been completed and all code is publicly available. Below is a full accounting of what was delivered against each proposal commitment.


The 5 Deliverables — Status

# Deliverable Status
1 Core SLA Metrics :white_check_mark: Complete
2 Network Test & Verification Signals :white_check_mark: Complete
3 Analytics & Aggregation Layer :white_check_mark: Complete
4 Public Dashboard & APIs :white_check_mark: Complete
5 Operations & Stewardship :white_check_mark: Active (1-year commitment ongoing)

What Was Delivered

1. NAAP Analytics Pipeline

Repo: GitHub - Cloud-SPE/livepeer-naap-analytics · GitHub

The core of the NAAP infrastructure. This repository contains:

  • REST API server with full OpenAPI spec (see live docs below)
  • Analytics ETL pipeline with extensive data analysis and validation
  • Grafana dashboard integration for live and historical SLA metrics
  • Production and staging infrastructure — security-hardened, with alerting and monitoring
  • 35+ million records loaded from Daydream with ongoing ingestion
  • Full development environment and extensive documentation

2. go-livepeer Updates

Repo: GitHub - Cloud-SPE/go-livepeer: Official Go implementation of the Livepeer protocol · GitHub

  • Proposal #2 changes (long-pending upstream, still awaiting merge after 2 years)
  • BYOC updates for NaaP: event emission, network capabilities, and BYOC-based selection logic
  • Kafka additions: BYOC and AI Batch event types
  • Kafka SASL security mechanism updates

3. AI Job Tester Updates

Repo: GitHub - Cloud-SPE/livepeer-ai-job-tester · GitHub

  • Live video-to-video support
  • BYOC and AI Batch event support
  • Testing for AI video jobs targeting individual orchestrators and network-wide selection/failure scenarios

4. NaaP UI

Repo: GitHub - livepeer/naap: Plugin platform for the Livepeer AI Compute Network · GitHub

  • Major contributions integrating Cloud SPE’s REST APIs
  • Significant architecture changes improving API consumption patterns and developer ergonomics
  • Improved UI performance through smarter caching

5. Leaderboard Serverless

Repo: GitHub - Cloud-SPE/livepeer-leaderboard-serverless: Serverless API for the Livepeer Orchestrator Leaderboard using Vercel for deploying these endpoints · GitHub

General cleanup only — no NaaP-specific changes were merged here. Initial prototypes were built in this repo but reverted; the architectural requirements for NaaP were better served by the dedicated analytics pipeline repo.


Live Infrastructure

Environment Service URL
Production Grafana Dashboard https://grafana.livepeer.cloud
Staging Grafana Dashboard https://grafana.cloudspe.com
Production API Server (OpenAPI) NAAP Analytics API
Staging API Server (OpenAPI) NAAP Analytics API

What Was Explicitly Out of Scope

As stated in the original proposal, the following were never part of NAAP:

  • Enforcing SLAs
  • Modifying protocol incentives
  • New routing logic
  • Protocol changes or decentralisation

Closing Notes

Thank you to everyone who engaged during the pre-proposal process — the feedback shared in the forum and during the Watercooler was directly incorporated into the final proposal scope. The reset toward a smaller, clearer MVP was the right call and allowed us to deliver focused, production-grade infrastructure.

The infrastructure is live, the code is public, and the 1-year operations and stewardship commitment is active. We’ll continue monitoring, maintaining, and supporting the community as the NaaP ecosystem evolves.

— Cloud SPE, April 2026

2 Likes