Metrics and SLA Foundations for NaaP

Thank you to everyone who reviewed the earlier pre-proposal and shared detailed feedback in the forum and during the Watercooler. The concerns raised around scope, cost, architectural risk, and MVP clarity were well-founded and directly informed this revision.

This updated pre-proposal reflects a deliberate reset toward a smaller, clearer Network-as-a-Product MVP. The scope has been significantly narrowed, the budget reduced, and the architecture simplified to prioritize time-to-value, reuse of existing Livepeer infrastructure, and immediate usefulness to gateways, orchestrators, and ecosystem teams.

Below is the revised pre-proposal. We welcome the community’s review and feedback on the updated scope, design, and framing. We will be present on this coming Monday’s Water Cooler for discussion.


Cloud SPE Pre-Proposal: Network-as-a-Product (NaaP) MVP – SLA Metrics, Analytics, and Public Infrastructure


Abstract

This pre-proposal seeks treasury funding for the Livepeer Cloud Special Purpose Entity (SPE) to design, build, and operate a focused Network-as-a-Product (NaaP) MVP for SLA metrics, analytics, and public visibility.

The objective of this work is to make the Livepeer network measurable, comparable, and trustworthy at a network level by delivering a small but complete set of standardized performance, reliability, and demand metrics. These metrics will be publicly observable and designed to support gateway providers, orchestrators, and ecosystem builders evaluating Livepeer as production infrastructure.

This MVP intentionally prioritizes time-to-value, architectural simplicity, and reuse of existing Livepeer infrastructure, while establishing a durable foundation for future SLA-aware routing, scaling, and productization efforts led by Livepeer Inc, the Livepeer Foundation, and the community.


Rationale

As Livepeer advances toward the Network-as-a-Product vision, predictable service characteristics and transparent performance signals become essential. While the network supports real workloads today, participants lack a shared, network-wide view of performance, reliability, and demand that can be used to assess suitability for production use.

Community discussions around earlier drafts of this initiative strongly aligned on the problem, while raising important concerns around scope, cost, architectural risk, and MVP clarity. This pre-proposal reflects that feedback by narrowing focus to a practical MVP that:

  • Demonstrates clear value with minimal complexity
  • Leverages existing data sources and pipelines wherever possible
  • Avoids protocol changes, enforcement mechanisms, or premature decentralization
  • Produces immediately usable outputs for real network participants

Key challenges addressed by this proposal include:

  • Fragmented metrics: Existing performance and reliability data is dispersed across systems and difficult for non-core teams to consume.
  • Limited network-level visibility: Gateway providers and orchestrators cannot easily compare performance across regions, workloads, or peers.
  • Adoption friction: Without transparent, shared metrics, external developers and partners struggle to evaluate Livepeer for serious workloads.
  • Missing foundation for NaaP evolution: Future SLA-aware routing, scaling, and automation require a trusted measurement layer first.

The Cloud SPE is well positioned to deliver this work as neutral, public infrastructure, building on its prior experience operating gateways, test tooling, dashboards, and analytics for the Livepeer network.

Importantly, this proposal does not attempt to enforce SLAs, modify protocol incentives, or introduce new routing logic. Its purpose is to establish shared measurement and learning infrastructure as a prerequisite for those future decisions.


Deliverables

The NaaP MVP will deliver a constrained, end-to-end metrics system focused on observability and learning inspired by the NaaP product MVP and Foundation roadmap.

1. Core SLA Metrics (MVP Scope)

  • A standardized set of network, performance, and reliability metrics sufficient to evaluate orchestrator and GPU behavior across workflows.
  • Metrics sourced primarily from job tester gateway and orchestrator-emitted telemetry, with targeted additions only when other Gateways opt-in.

2. Network Test & Verification Signals

  • Operation of one or more reference load-test gateways to generate consistent, reproducible performance signals for live AI video pipelines.
  • Public test scenarios (aka test datasets) designed to reflect real workloads while remaining transparent and community-verifiable. These will be captured in Github.
  • Test results contributed into the same analytics layer as organic network traffic to enable comparison (when other Gateways participate).

3. Analytics & Aggregation Layer

  • Lightweight ETL and aggregation pipelines to transform raw metrics into network-level views.
  • Computation of a small number of derived indicators as outlined in the Metrics Catalog
  • Data structured for efficient querying without requiring dashboards to load raw job data.

4. Public Dashboard & APIs

  • A standalone public dashboard presenting live and historical metrics.
  • Public, read-only APIs for aggregate SLA scores and hardware.
  • Clear paths for gateways and ecosystem teams to consume the data directly or mirror it into their own analytics systems.

5. Operations & Stewardship

  • Ongoing operation of testing, analytics, and dashboard infrastructure.
  • Maintenance, monitoring, and community support for the MVP for 1 year.

Any scope not outlined here is not part of the Deliverables and out of the scope of this proposal.


Key Milestones

Milestone 1 – Metrics Collection & Aggregation

  • Define and implement the minimal metrics set
  • Aggregate existing telemetry into a unified analytics layer
  • A basic dashboard showing sample data flowing end to end

Milestone 2 – Test Signals & Derived Analytics

  • Deploy reference load-test gateways
  • Launch a public dashboard with core views
  • APIs for ecosystem consumption

Milestone 3 – Stabilization & Review

  • Harden infrastructure for reliability and cost efficiency
  • Document metrics, assumptions, and known gaps
  • Review outcomes with the community to determine next steps

Timeline

Delivery is anticipated to take approximately six months (and already underway as of November 2025). This is dependent on the team’s development velocity and subject to change. Preliminary design and validation work has begun to reduce delivery risk.

  • November 2025 - Works began on original proposal and discovery process
  • February 2026 – Milestone 1: Metrics Collection & Aggregation
  • March 2026 – Milestone 2 – Test Signals & Derived Analytics
  • April 2026 – Milestone 3 – Stabilization & Review

Budget

Total Requested Budget: $90,000

This budget supports:

  • Engineering work to aggregate, validate, and expose SLA-relevant metrics
  • Development of Load Testing Gateway (AI Job Tester + Gateway enhancements) and Network Data Scraper
  • Development of minimal analytics and public-facing dashboards
  • Development of DevOps infrastructure and automation
  • Operation of testing, analytics, and storage infrastructure for approximately one year
  • Ongoing maintenance, documentation, and community support

The budget is intentionally sized for a thin but complete MVP, designed to validate assumptions, inform future investment, and avoid long-term commitments before value is demonstrated.


Closing Note

This pre-proposal reflects extensive community and Livepeer Inc feedback and represents a deliberate step toward a simpler, clearer, and more actionable NaaP MVP.

By focusing on shared measurement rather than enforcement or protocol change, this work aims to give the Livepeer ecosystem a common understanding of network behavior today — and a solid foundation for deciding what to build next.

7 Likes

Hello @speedybird.

We think that this proposal improves considerably the offered value to the protocol via:

a) reduced architectural complexity(and future debugging timers)
b) implementation of data aggregation instead of building metric solutions ex nihili.
c) introduces a research phase(milestone 1) where the team assesses the needs of the ecosystem.

To clarify feedback from the last Watercooler: protocol stakeholder(particularly orchestrators) are structurally aligned with Livepeer’s long-term success and are often incentivized to contribute under economics that prioritize network growth over short-term premium pricing. Maintaining this alignment is important, as consistently higher cost structures for core infrastructure work can limit the protocol’s ability to onboard and incentivize the next wave of startups needed to drive sustainable fee growth. The revised pricing structure strongly articulates this position and represents a near game theoretic optimum for aligning incentives across protocol stakeholders.

Some questions that might help you during the phase 1 of the deliverable:

  • What you need to know as a new builder that just entered the community and wants to engineer some workloads? Would it be helpful to have a dashboard with the GPUs and system(s) spec that each orchestrator supports?

  • How quickly can the proposed solution integrate metrics for new workloads? Will it need feature/workload specific engineering or it is integratabtle in a plug and play manner?

  • How should we aggregate data from maintainers of workloads, should there be a standardized process where any new builder can setup an aggregation endpoint that Cloud SPE can consume?

Honestly, great work with implementing all the feedback from the community. We would be very glad to support your effort.

4 Likes

Kudos to @MikeZupper and @speedybird for the rigorous effort put into shaping this proposal.

From my side, I can confirm that the proposal is strongly aligned with the Make Network Data More Observable roadmap item. It covers all of the must-haves and should-haves, and addresses most of the nice-to-haves outlined in that brief.

It is also worth emphasizing that the original roadmap item was intentionally high-level, meant to spark exploration and direction rather than define an exhaustive specification. Significant work has since been done by both the Cloud SPE team and the community to refine this into a concrete proposal, grounded in explicit needs identified for the network as a product initiative.

Overall, I believe this is strong work. The team is well-positioned for success, and any minor adjustments can be handled pragmatically during execution.

5 Likes

Great to see this next iteration. Getting the right scope here has meant dealing with a lot of complexity. Huge shout out to @MikeZupper and @speedybird for integrating feedback quickly and effectively.

Moving forward, I would encourage regular collaboration with Inc, the Foundation and a few other core contributors who see this piece of work as critical to the Livepeer network being a go-to, transparent network for video / AI compute. I know that @Mehrdad will be a great thought partner here for how we can keep the work transparent and accountable to the milestones.

Awesome job. Excited to see the proposal onchain. :fire:

5 Likes

Being the key stakeholder for this proposal (milestone 1), from Inc, I really appreicate cloud spe team take the initiaitve to respond to community review comments, with both speed, and open mind flexibility.

the new proposal is in line with what Inc (daydream) project is looking for in terms of the fundemental network obserability. It is very important step for both Inc and community to have a measuable network foundation, so as gateway providers can bring the SLA to their users. All these start from a transparent, systematic, and extensible way to collect, and consume the key ai job metrics, to start with. this is exactly what this 2nd draft of proposal try to accomplish.

to ensure the accountablity, and creditablity of all parities invoved, foundation has helped us to initate developer chat weekly to inform the community what is happening, and also we have bi-weekly sprint review to ensure the progress is clearly demoable. and also milestone key deliverables are outlined, by could spe very clearly in the proposal.

on behalf of Inc, I endorse this proposal, and committed to work with cloud spe to deliver it.

7 Likes

Thank you everyone for your feedback and support! After the discussion in the water cooler and support expressed in the forum, we intend to put this up for a vote at the end of today or early tomorrow. We look forward to the next phase of this project and continue feedback from the community.

3 Likes

As an update, this pre-proposal has been promoted to a full Treasury proposal. Please participate before January 25th by casting your vote.

We are looking forward to the next steps for Livepeer and making a meaningful contribution to the roadmap!

1 Like

Cloud SPE — Update #1

Period: January 1, 2026 – January 31, 2026

Status: On track

Summary:

Following the successful passage of our Treasury Proposal, we have established our core data infrastructure and completed a comprehensive inventory and quality validation of existing network data. These foundational steps ensure the accuracy of our analytics pipeline as we approach the completion of Milestone 1.

Completed Deliverables:

Milestone 1: Initial Infrastructure & Data Validation partial delivery - Feb 6 Due date

Built the initial infrastructure to support data ingestion, processing, and deployment to the analytics stack, resulting in a basic initial functional end-to-end flow to a basic Grafana dashboard.

ETA for Next Update: March 1, 2026

Planned by Next Update:

  • End-to-End Dashboard: Deployment of a public Grafana dashboard displaying several key live analytics metrics.
  • Data Integrity: Completion of all data quality validation and automated capture processes.
  • ETL Pipeline: Significant progress on the Extract, Transform, Load (ETL) processing pipeline to handle complex data transformations.
  • Infrastructure Stress Test: Validation of the current infrastructure’s stability and scalability to handle increased data loads.
4 Likes

Cloud SPE — Update #2

Period: February 1, 2026 – February 28, 2026

Status: On track


Summary

During February, we progressed from foundational setup into active operational validation. With Iterations 1–4 now complete, the analytics stack is running real-time workloads and supporting live performance testing across multiple regions.

The core data pipeline, schema design, and validation work are now finalized, positioning the project to transition from alpha infrastructure into production-grade APIs and SLA measurement in the next phase.


Completed Deliverables

Milestone Progress: Infrastructure, Real-Time Testing & Pipeline Implementation

Expanded the analytics platform from initial infrastructure into a fully operational real-time testing and processing environment, enabling continuous measurement of orchestrator performance and AI workload characteristics.

  • Iterations Completed: 4 of 7 total iterations

  • AI Job Tester:
    Running real-time AI video job tests across SEA, MDW, and FRA regions

  • Grafana Dashboard (v2):
    Grafana

  • Data Layer:

    • Data validation and quality processes completed
    • Finalized schema design and query patterns
  • Processing Pipeline:

    • Apache Flink data pipeline designed and implemented
  • APIs (Alpha Release):

Note: All data is currently from the Cloud SPE AI Job tester and does not include production job data from Daydream.

These deliverables collectively establish the first fully integrated analytics loop:

test → ingest → process → visualize → query via API


GitHub Links:


ETA for Next Update

March 31, 2026


Planned by Next Update

  • Production Workload Ingestion
    Integrate Daydream data to reflect real application demand patterns

  • API General Availability
    Release finalized versions of all analytics APIs

  • SLA Scoring
    Deploy the production SLA scoring algorithm and provide API access

  • Gateway Performance Testing

    • Orchestrator swap rate analysis
    • Selection algorithm validation
  • Documentation (Drafts)

    • API specifications
    • Analytics pipeline architecture
    • Data schema and design guides
  • Production Readiness

    • Security hardening
    • Infrastructure scaling and performance optimizations
1 Like

Are there plans to expand analytics coverage to other network pipelines, or is Cloud SPE currently focused exclusively on go-livepeer and Daydream/Scope workloads?

1 Like

We definitely want to expand to other workloads. The initial scope is focused on daydream (Due April 10 2026). The Cloud SPE has done a lot of work with custom BYOC pipelines we plan to bring to the network. Analytics will be key for BYOC workloads.

1 Like