Test Stream Post-Mortem 5/19/22

alex_livepeer · May 19, 2022, 11:03pm

Summary

In April there were a number of instances where test streams from multiple test broadcasters stopped being streamed, eventually this effected nearly all LivePeer orchestrators across all availability regions. Fortunately, we were alerted that this was occurring and progressively increasing in severity via orchestrators who had logs and metrics to provide context as to what was happening from their perspective.

Initially this began as a small number of orchestrators noticed that their stream count was severely decreased (trending to zero). This context helped to eventually query logs from broadcasters to understand why a “too many connections” error was showing up in large numbers. This hadn’t happened in testing prior to the deployment of a new version of Go-LivePeer.

The error source was eventually identified by one of our core developers Rafal as a change linked to a specific commit, the commit was rolled back (removed) and once the branch with this commit removed had been deployed the errors dissipated.

Next Steps

The disruption was a result of the limits of testing Go-LivePeer releases in conjunction with a blind-spot in our monitoring tools. In short, the errors were not visible in unit tests, but only after a node has been for awhile. The root of the issue is that generally test-streams are treated differently than other “production-level” systems implemented by LivePeer, an example of a “production-level” system is the critical smart contracts necessary to migrate from L1 to L2.

In order to prevent similar incidents in the future:

Implement a “incident escalation pipeline” to make communication from orchestrators more direct
- < in progress>
add specific alerting for large dropoffs in test-streams
define improved testing process for releases that may affect test-stream infrastructure
ensure we’re monitoring test-stream infra with proper granularity

If you’re continuing to observe errors regarding test-streams or dropoffs associated with a specific region please report them in this thread.

Thanks again to @papa_bear for initially reporting this error in this Github issue.

Topic		Replies	Views
Recent "Strange" Work Distribution Patterns from Livepeer Inc. Broadcasters Transcoders	2	362	February 17, 2022
Weekly Update - P2P Networking, Demos, and More - 9/9/2017 Updates	2	1448	September 13, 2017
Discussion: Orchestrator Incident Escalation Pipeline Transcoders	8	479	August 10, 2022
Demand Side Traffic Patterns & App Integration Processes Transcoders	0	617	August 24, 2021
Next Milestone: Testnet Launch Updates	1	1299	September 9, 2017

Test Stream Post-Mortem 5/19/22

Summary

Next Steps

Related topics