Abstraction of the Livepeer Protocol to support more capabilities

This is the beginning of a thread to collect thoughts and feedback on a protocol improvement proposal. This informal discussion can be used to formalize it into an LIP if it is an idea that someone wants to champion.

Abstracting the Livepeer Protocol To Support More Capabilities

Currently, the Livepeer protocol defines a workflow for nodes to register on chain in order to be discovered to perform video transcoding. There is also a verification mechanism on chain, which can be enabled to ensure they performed this transcoding correctly. Because there is only one on chain verification function baked into the protocol, the only work which can be secured via on chain stake is a single capability - Livepeer Media Server specific transcoding.

The proposed idea here is to abstract the protocol a bit to allow nodes to register with a <capability, verificationFunction()> pair, such that the network can open up a bit to allow more video-specific capabilities secured by whatever verificationFunctions researchers and the market propose.

What are some example capabilities that could be registered on the network, enabling Livepeer’s extensibility into the world’s open video infrastructure?

  1. Value-add video-processing services, such as object detection and scene classification.
  2. Multiple types of transcoding and video filters.
  3. P2P CDN Peer Orchestration
  4. Validation and verification services
  5. Storage gateways
  6. Etc…the list goes on


The two main benefits I see of abstracting the protocol at this stage are:

  1. More extensible platform with more capabilities beyond just transcoding.
  2. The verification problem for each of these capabilities gets moved into the open market rather than being protocol determined. This is helpful for non-deterministic types of tasks. Those who develop verification functions that give the users economic security will benefit from more work, relative to those who offer capabilities with weak or non existent verification functions.

Open questions

There is likely much research to be done in order to open up the protocol a bit in this way. Here are a couple open questions.

  1. Would slashing percentages also have to be offered by the nodes so that users can understand the economic security offered in the case of the provider failing verification? Currently this is set in the protocol for the single verification function, but if there are multiple capabilities and verification functions, they may have to be set by the market.

  2. Would there need to be a registry of available capabilities that nodes can register with? How do these get added to the protocol? Via governance I presume. But another option is that the capabilities could just live as conventions, and all clients could adhere to these conventions. An on chain registry is probably a better idea. Perhaps it could be open and write-only though, such that anyone could add a capability by just writing a new key into the registry.

Overall, I’m excited about taking the first steps to extend the Livepeer protocol in ways that will allow it to truly be the basis for the world’s open video infrastructure. Looking forward to feedback and discussion here, so that we can formalize this into an LIP.


I think this looks excellent. A neat way to open up the platform for more permissionless addition of services.

I’ve often enjoyed thinking about a Transcoder performing speech-to-text processing, enabling generation of live subtitles for a stream (and perhaps even translation into multi-language for a Consumer).

@dob would you (or someone) be able to describe how something like that might plug in to the framework described? Just to bring a real-life imaginable example to life e.g. how might a verificationFunction() work in such an example? Thanks.

Yes, captions and subtitles would be a very cool examples of an additional capabilities that a node could support. So how could this theoretically work, and what could a verification function look like?

How the capability might be added

  1. A node operator or developer could add a capability such as “WebVTT Captions” to the capabilities registry. They could publish a spec offline of what it means to adhere to this capability so that client implementors would know how to make use of it.
  2. They would implement this feature into the go-livepeer node, or a fork, or their own client.
  3. When they register as an Orchestrator, they could specify which capabilities they support. This way, Livepeer B nodes could discover them on chain when searching for nodes that support the “WebVTT Captions” capability.
  4. They would transact offline to perform the work in exchange for micropayments in the same way that nodes currently interact for transcoding. They would likely be transcoding + adding captions at the same time, providing both capabilities.

In the most useful sense, this capability would be in go-livepeer by default, and all nodes running the default software could provide the capability in addition to transcoding.

What the Verification Function might look like

There are lots of options here. Let me give two that are on complete opposite ends of the spectrum.

Deterministic Truebit-style verification
In the case that there happens to be a deterministic algorithm that will always produce the same exact captions for a given segment of video, it is possible to use a smart contract judge to enforce that the work was done correctly using a Truebit like protocol. In this case the node operator, by registering for this specific capability, may just be attesting that they are running the default implementation provided in go-livepeer - no more, no less. They say what code they will run, and it can be proven that they ran that code correctly.

Non-deterministic, adjudicated verification - Aragon Courts?
In the case where there is room for multiple implementations, or the implementations depend on the training data used on a specific model to generate the captions, there’s no way of telling via a machine whether the captions are “correct” or not. In this case, maybe a more social contract between the node operator and user would be useful. For example, a verification function could be set up that looks something like:

Disclaimer: I am not familiar with the inner workings of Aragon Courts so this is just a speculative idea, but hopefully it gets the spirit across.

  • A smart contract which is set up to automatically invokes an Aragon Court case based on the inputs given when a user submits a challenge.
  • The smart contract specifies the “social contract” that will be given to the jurors. Something like “The video contains very clearly audible content in language X, Y, or Z. The node operator did not maliciously output content with the intent to misrepresent the dialogue in the video. Missing or empty captions are ok and not a violation of the terms. Etc.”
  • If a user feels that they received a signed segment of video back from the node operator, with malicous captions that were harmful to their experience, they could submit the challenge under these terms, and provide the segment of video as input to the court.
  • If, eventually the Aragon Court determines the social contract was broken, then the node operator gets slashed. (Reminder, and open point above is whether node operators can indicate their slashing amount in the case of a failed verification function.)

This may sound slow, and like overkill, but it can actually provide economic security under a certain set of agreed upon parameters, that allow people to trust and adhere to the protocol. Most importantly, as nodes advertise their <capability, verificationFunction> pair, where a verificationFunction is just something like a smart contract address, the market can develop and test the appetite for different verificaitonFunctions that provide different tradeoffs and levels of security.

Very cool.

As regards adapting the go-liveper software, would they also need to adapt the Broadcaster software to allow for the discovery of the new features offered by the Orchestrator?

Also, do you ever see a world where Orchestrators might not actually be offering any Transcoding services, but perhaps subtitle Transcribing (input media, output text), i18n Translating (input text, output text[]), Flagging e.g. inappropriate content (input media, output boolean), Digital Remastering cleaning it up (input media, output media) etc.? Or do you think it needs to be a sufficiently “heavy” (hence “expensive”) workload to make it sensible for a B and O to contract?

As regards adapting the go-liveper software, would they also need to adapt the Broadcaster software to allow for the discovery of the new features offered by the Orchestrator?

Yes, for a new capability to exist, client software would have to invoke it and make use of the results. This could be in go-livepeer itself, or it could be in another client built against the Livepeer protocol. Though I don’t underestimate the amount of work to recreate things that are already built for you in go-livepeer like all the blockchain interactions, PM implementation, networking, etc.

Also, do you ever see a world where Orchestrators might not actually be offering any Transcoding services…

Yes. I think it depends on the task. If it is something that is an independently sought after capability, then there’s no reason that someone couldn’t provide just that capability (especially if it required different hardware than transcoding). However there are a lot of synergies in terms of being able to perform multiple capabilities at the same time because you only have to send the video to one place and you only have to decode it once.

Some thoughts regarding verification. If I understand correctly, on conceptual level, the Capability is a pure function (doesn’t affect external or have an internal state) which is repeatedly applied to the latest subsequence of the data stream. In case of video, the subsequence is a few second chunk. If the Capability is, say, face detector, it will produce face rectangles for each frame of the video. Thus, if the unit of work we need to verify consist of a fairly large number of such subsequences (a few second chunk of video is hundreds of frames, required subsequence for face detection is 1 frame), the verification could be done statistically, by sampling a small number of subsequences, applying the same Capability function to them, and comparing with Transcoder results. While not providing a 100% guarantee specific unit of work is performed correctly, it will not require implementing any capability-specific verification logic, because Capability=verificationFunction, and will only add a fraction of original task complexity as an overhead.

Something to consider when the capability function == verification function is whether the capability function is deterministic. Transcoding on a GPU is an example of a non-deterministic capability function which is why much of the recent research into the verification function has focused on video metrics where capability function != verification function.

AFAIK deep learning model (which presumably would be the basis for capabilities like face detectors, object detectors, etc.) inference on GPUs can also be non-deterministic although there may be ways to force determinism - see this presentation.

That being said, unlike with transcoding where outputs would need to always be identical given the same inputs in order for transcoding to also serve as the verification function, it might be ok for model inference results to not be identical for the same inputs if the behavior of the models are statistically very similar. In this case, you could still use the same model for the verification function, but use the output of the model to flag a result signed by a orchestrator for the human operator of the broadcaster to review. While, this workflow would not be completely machine automated, it does acknowledge that machines cannot determine the objective truth 100% of the time and the ultimate priority with a lot of these video capabilities is to generate some output that is subjectively appealing to the human visual eye so having humans as a possible arbiter for dispute resolution may have some sense as well (as mentioned in the adjudicated verification example in the OP).

@yondon thanks for sharing this presentation, really interesting read. I think, it is focused mainly on model internal state reproducibility during training process. While inference is not strictly deterministic process on GPU too, in my experience, it’s not hard to tell if predictions from multiple model runs are “same” by just comparing them up to a small tolerance. Such tolerances could be set manually, or estimated automatically for user-defined capabilities. What probably won’t be feasible is to compare some sort of hashes computed directly on raw predictions, they won’t match for predictions like ‘Panda: 0.93245’ and ‘Panda: 0.93222’.