Some thoughts on Orchestrator's collating metrics

A common metric all Orchestrators wish they had access to is demand-by-region and current avg and max streams + number of broadcasters by region. These appear innocuous enough, but starting a community driven effort to collate this bears scrutiny, discussion and a majority agreement within the Orchestrator community.

Having discussed issues around this with other Orchestrator’s (O’s), it appears, besides the two metrics above, we all seek different metrics from the Livepeer network. Thankfully Grafana’s remote write feature enables zero trust and selective metric sharing and is part of the stack used by all O’s today.

  1. A minimal set that can answer the two most sought after metrics (above) should be all required of any orchestrator to opt-in. Any other metrics are optional. No metric should require root access or new untrusted binaries to collect. No O should be expected to open a port to share metrics.

  2. This effort should be “By Orchestrator’s, For Orchestrator’s”.
    O’s should decide what information casual visitors to this site are allowed to view, it should be anonymized and geo-location accuracy reduced by 50-100 miles.
    Only O’s should be able to login (Metamask?) and see the rest of the metrics, with accuracy.
    This effort should not be collecting this data to share or sell to anybody, ever.
    It should collect the least metrics by default, enabling O’s to share more if they so desire.

  3. If this is to serve as a community driven effort, we ought to collect metrics from sources other than the livepeer binary, like the kernel / NIC driver etc. and ensure that some metric types, common to all of them, do indeed converge as expected.
    For example, if livepeer reports it has transcoded 1 hour of video (10 streams at 1080p + 10 streams at 720p), the amount of traffic reported on the Orchestrator port reported by the kernel / NIC driver should converge with the data extrapolated from livepeer.
    The data transferred should be within a 1-2% error margin as reported by all sources.

Introducing Prometheus Agent Mode, an Efficient and Cloud-Native Way for Metric Forwarding | Prometheus is an interesting read that touches on some of the challenges and pitfalls for a global effort like this.

2 Likes

I am going to preface my comments by saying the mechanics of how this works is well beyond my area of expertise.

While I understand this is being proposed as a community driven effort, I’m concerned about the value of the data if this is done on an opt-in basis. If adoption isn’t 100% which I don’t think it will, it seems to me that the data could be skewed in a way that would result in metrics that are not only incomplete but possibly misleading defeating the goal.

My preference is to keep the location accuracy limited to region. Geo-location with metrics is probably more than I’d like to share.

I can see why certain metrics would be “nice” to be visible network wide but again I don’t know how valuable the information will be if we are using things like number of sessions or minutes transcoded, which IMO aren’t very useful without knowing the resolutions (or number of pixels) of each session. i.e. 1x 1080,720,480,360p > 4x 480p sessions.

I didn’t hear the entire call where this idea was proposed but if I understood correctly the main intent was to have an idea of overall Livepeer traffic to gauge if an operator’s orchestrator was working properly when they weren’t receiving streams by looking to make sure the network wasn’t down, I think this may be of interest to new orchestrators however for that purpose alone I don’t see this being worth the effort.

Who will be tasked with administration of the effort? Would it be the group that opts-in and how will the data be kept secure from being shared outside of the agreed upon use?

I very well could be missing the point so please comment if I’m not understanding what we will gain from this.

As an orch, would be you more willing to share metrics if the aggregate data is only available to the orchs that also choose to share their metrics?

Some fair Qs here, here are my thoughts.
Who should collect these metrics then, Livepeer.org? What would be their justification for it?
Of the 100 O’s that can receive work, minus the few that do not transcode video, there are less than 40 O’s on Discord when it’s busy. It’s safe to assume that 100% opt-in is a pipe dream, and that’s ok.

Let’s take an example of a currently busy region like Europe, with two large O’s (stake and capacity) and 5 smaller O’s (stake and capacity). Let’s assume the two large O’s do not opt-in, but 4/5 of the smaller O’s do opt-in. Since we know the two large O’s have not opted in, and are likely receiving most of the work, we can extrapolate some information (ETH payouts to the large Os) based on the metrics received from the 4 smaller O’s. It may not be 100% accurate as we are missing data from two O’s, but that does not make it misleading.
If the metrics are “incomplete” the information presented can make a note of this to the O or viewer while presenting it.
This is not a paid consulting service for prospective new O’s to choose locations, though it is a good starting point. We are merely presenting the available data, with its caveats, which is better than no data, which is the case currently.

It is a misconception that geo-location of IPs is some sort of secret. None of it is secret, by design.

IP addresses are managed by five Regional Internet Authorities for each region. The IP addresses are owned in blocks, by network operators (ISPs/Datacenters), who are assigned these chunks of IPs by the RIRs under an Autonomous Systems (AS). You can query the RIR’s database to find out the operator, and use a 3rd party service like Maxmind to geo-locate the IP with a 100 Km accuracy. This has existed since the RIRs came into existence and long, long before Livepeer existed. This is how law enforcement knows which ISP to send a notice to when a judge approves the investigation.
This is exacerbated by the fact that an O’s hostname / IP is registered on-chain.

If you ask different orchestrators which collective metrics they wish to see, out side of some common metrics, every one wishes to see different metrics. At least based off the 6 odd O’s I spoke with.
While this data might seem useful to new Os, I’d argue that it’s more important to existing established O’s with large stake wishing to increase their presence in places with emerging traffic or add more capacity in locations they already operate from. If an O feels it’s not worth the effort, don’t opt-in, nothing changes for that O, they merely don’t benefit from viewing the collated metrics. This is a long term effort, and should not be viewed from the myopic viewpoint of just Livepeer.com’s traffic which is the case currently.

That is what discussions like these are for, to get opinions, who do you feel should be tasked with this effort? Who will fund the development and infrastructure costs? How do we ensure transparency and accountability for collected metrics?

To sum up, currently, the only information O’s have is an approximation of the number of GPUs available and the % of their current usage. This is abysmal network information for an open public project and could be improved. As generators of the metrics, the Os are best positioned to collect and share these as they see fit.

Yes, I am proposing that should be the default policy.

While I understand that short term we would be able to leverage prometheus to aggregate this data. I agree that getting opt-in for the metrics wouldn’t be 100% and as such the data would be next to useless.

However, as livepeer is the one who is pushing client changes and developing the application that we use for orchestrating. I do not think that it is an unreasonable assumption that they could integrate annonymous data reporting into the application. Only thing is how this data gets consolidated is a challenge. The one that makes sense but is Taboo is crypto is to have livepeer consolidate and centralize the data and create a dashboard or provide an API to the data that it collects. Again, as livepeer, the developer/company behind this project. I dont really see the harm in them having network data like this.

I also feel as though base data (# of streams being handled, Resolution of streams, Country of Origin (or egress)) and other data without identifiers (identifying information of the B/O/T) should be required.

This information as the most impact to Orchestrators with the least impact to privacy seeing as if its anonymous and contains no IP addresses, names etc., it bears no privacy concerns.

While we like to say that livepeer is a company of the people, it still is a company and is running the base of this project and as such can impose this data collection. It is already available inside of the livepeer client, prometheus just picks it up for grafana reporting.

Country of origin would likely be enough data to make assumptions without the need for anything more granular.

I think this portion may be the wrong mindset. It’s basically saying if I am already setup and running then what do I care if this exists. While yes this would be primarily useful for people who are getting setup or are trying to figure out stability issues. This could have other unidentified benefits that we have yet to uncover besides helping new users troubleshoot and get insight. More of good data is almost always a value add. Like Strykar said for medium/larger O’s this could be helpful to look at expansion. Knowing that majority of traffic is currently coming out of Germany, which we found out due to the last water cooler prompts the urgency to try to figure out a european O/T Setup, for example.

First of all, thank you all for engaging with this ideas - great stuff. I am mostly staying on the sidelines here as I believe this should be an orchestrator-driven effort but I want to flag and clarify a few points:

where should this data live

I mentioned this to strykar in DM, but I would eventually be interested in rolling this into the explorer (accessible only to Os)

However, as livepeer is the one who is pushing client changes and developing the application that we use for orchestrating. I do not think that it is an unreasonable assumption that they could integrate annonymous data reporting into the application.

it still is a company and is running the base of this project and as such can impose this data collection.

It’s certainly true that Livepeer Inc is the primary (and often sole) entity leading development on the Livepeer Network at this point in time but it’s worth noting that all of this software you’re discussing is open source. While I understand that Livepeer Inc can seem monolithic, there is no reason that the orchestrator community couldn’t implement this exact change without relying on Livepeer Inc.

If the O community is (1) willing to put in the work to set up these dashboards and (2) concerned about opt-in and would rather have the node software collect and anonymize data, why not submit a PR to update the node software in go-livepeer?

To elaborate on the relationship between the community and go-livepeer/ the explorer/etc: if the Livepeer Network grows as rapidly as we hope, we are ALWAYS going to have this exact bandwidth problem where there are more requests from the B/O/T communities than LP Inc has the resources to address. In my view this is a great example of a case where LP Inc can support the community but not own the implementation.

This is 100% incorrect :slight_smile:
Can you identify the network issue in the packet capture below?


Here is the packet capture for download - https://drive.google.com/file/d/111G3dTXDZfLABkeqLqMTpGtzwYTgI24n/view?usp=sharing
If not, at least share the factors that lead you to believe the incomplete metrics would be useless.

Inference is used every day in network analysis and forensics for capacity planning to identification.
It is a science and not guesstimation - Packet Analysis.
It can be both easy and hard to infer data depending on the metric, but since the limitations are known, we build a picture around those caveats.

Having livepeer.org collect metrics is a slippery slope, they also run livepeer.com. Who will watch the watchmen?

Agreed, the base metrics exported should be useful enough for all O’s.

I feel O’s (due to current economics) have a blinkered view of Livepeer, we should look at it as a protocol, not a spinoff of Livepeer.com.

My original post seems to be caught by the spam bot. Woops.

Anyways…

I think there are two discussions here:

  1. How can Orchestrators better make sense of their own data? (new metric sources, new aggregates, alternative queries of existing livepeer metrics, etc)
  2. What would a community data analytics environment look like?

While there is absolutely value in collecting these additional metrics, the intent should not be to verify what livepeer reports. We are 100% capable of validating what the code reports in a much simpler and less error-prone way. It’s open source, so let’s take advantage of that.

I agree with Stryker on this one that we can still extract a lot of value based on inference. We have the tools available to identify the location of nearly every single Orchestrator along with their stake. Using that, we can make some solid guesses on sessions.

IMO anonymizing data should be on us and we should keep the livepeer client as simple as possible. Our community efforts should be decoupled from livepeer. It’s worth noting that the livepeer client will not be the only client in the future. As the community grows, other client implementations will pop up.

Does this mean that Livepeer would host the infra for ingestion? :eyes:

Just reemphasizing that I’m not a huge fan of this option. This is because I feel the livepeer client should be as simple as possible and stick to its core purpose (b/o/t operations… and i guess r [redemption] too)

I get the logic of what you’re saying. I would argue that it should also facilitate effective orchestrator workflows and generally help orchestrators achieve better quality of life, though

Does this mean that Livepeer would host the infra for ingestion?

i’m not sure what this would look like… frankly i’d prefer for it to use decentralized hosting / storage

I agree, what are our options to implement this as a decentralized service?

Yep, I hope some livepeer devs chime in here, I don’t suppose they’re gonna be chuffed to hear this feature request.

I don’t think submitting these stats to a central location is the right approach. It’d make much more sense to me if the HTTP RPC port that is already open on the orchestrators optionally also makes available a summary of its handled jobs for the past few hours. That would allow anyone to simply grab the list of orchestrators, contact them all periodically, and build their own view of the jobs the network has handled. If we make this endpoint compatible with, say, Prometheus’ metrics format this could be almost trivially easy to set up - for anyone.

Then, Livepeer Inc could make available a public copy of these stats for convenience, but anyone that doesn’t want to be dependent on this or doesn’t trust them to be accurate enough can collect the same stats themselves.

Yes, a single node could theoretically skew the statistics here by supplying false information, but verification is possible to some degree by comparing the reported stats with the (winning) tickets on-chain. Over a long enough period of time, those two should converge.

It’s important to stress that while orchestrators are public already anyway, broadcasters and their streams are not. As such, I propose that the statistics on transcode jobs are made anonymous to some degree: instead of reporting source IP addresses, source locations could be geo-located and binned to, say, 5-degree buckets in both longitude and latitude. That should provide enough precision to help inform new locations for orchestrators/transcoders without revealing so much information that potential users of the network are scared away by their usage of it becoming public.

Ideally, this could be used to show a world map of where transcode jobs are coming from, and highlight “gaps” in network coverage. :star_struck:

3 Likes