Test Stream Post-Mortem 5/19/22

Summary

In April there were a number of instances where test streams from multiple test broadcasters stopped being streamed, eventually this effected nearly all LivePeer orchestrators across all availability regions. Fortunately, we were alerted that this was occurring and progressively increasing in severity via orchestrators who had logs and metrics to provide context as to what was happening from their perspective.

Initially this began as a small number of orchestrators noticed that their stream count was severely decreased (trending to zero). This context helped to eventually query logs from broadcasters to understand why a “too many connections” error was showing up in large numbers. This hadn’t happened in testing prior to the deployment of a new version of Go-LivePeer.

The error source was eventually identified by one of our core developers Rafal as a change linked to a specific commit, the commit was rolled back (removed) and once the branch with this commit removed had been deployed the errors dissipated.

Next Steps

The disruption was a result of the limits of testing Go-LivePeer releases in conjunction with a blind-spot in our monitoring tools. In short, the errors were not visible in unit tests, but only after a node has been for awhile. The root of the issue is that generally test-streams are treated differently than other “production-level” systems implemented by LivePeer, an example of a “production-level” system is the critical smart contracts necessary to migrate from L1 to L2.

In order to prevent similar incidents in the future:

  • Implement a “incident escalation pipeline” to make communication from orchestrators more direct
    • < in progress>
  • add specific alerting for large dropoffs in test-streams
  • define improved testing process for releases that may affect test-stream infrastructure
  • ensure we’re monitoring test-stream infra with proper granularity

If you’re continuing to observe errors regarding test-streams or dropoffs associated with a specific region please report them in this thread.

Thanks again to @papa_bear for initially reporting this error in this Github issue.

3 Likes