Fast Verification Rollout Post-Mortem 10/21/21

The Livepeer Inc. team has been working on enabling fast verification for transcoding on the network. A few weeks ago, go-livepeer 0.5.21 was released which included orchestrator/transcoder support for computing MPEG-7 video signatures, the perceptual hash algorithm that is used in the first version of fast verification. Since then, the Livepeer Inc. team has been developing the broadcaster implementation of the fast verification algorithm as described in the fast verification design.

On 10/21/21, the Livepeer Inc. team began running tests on the network using the latest broadcaster implementation of the fast verification algorithm. During these tests, a number of orchestrator operators reported that their orchestrators crashed and after reviewing some of the provided error logs it became apparent that the orchestrators were encountering errors while computing video signatures on the GPU. The video signature capability had been already tested previously, but there were a few gaps:

  • The tests were only run on GPUs that the Livepeer Inc. team has access to
  • The tests were only run on GPUs in a Linux environment

While we are still investigating the potential impact of the GPU model on the video signature capability, we do know that many of the errors that orchestrators encountered were caused by a Windows specific issue and a fix has already been implemented and is currently being tested.

Next Steps

The disruption to orchestrator operation caused by these fast verification tests was a problem and the team will use the lessons from this experience to make the required fixes and rollout fast verification in a less disruptive way. The next steps are:

  • Complete the fix for the Windows specific issue with the video signature capability for orchestrators/transcoders
  • Complete the investigation of any non-Windows specific issues with the video signature capability reported by orchestrator operators
  • Run additional tests for the video signature capability in both Linux and Windows environments
  • Run another fast verification test on the network once the above steps are complete

If you observed any crashes and/or error logs containing cudasign it would be very helpful if you could share more information such as additional error logs either in this thread or in Github.

I was running into out of memory errors on my transcoders. 05.21 on ubuntu. I am running some P1000s with 4gb memory. What memory should we leave available for the verification processes?

Thanks for the information!

Based on previous tests we expected any GPU memory usage increase to be fairly negligible, so the dev team is now investigating these out of memory errors. Interestingly, we noticed in at least one case that an out of memory error was reported, but the memory usage reported by a Prometheus/Grafana setup was relatively low which raises suspicions of some other problem occurring.

Will follow up when we know more, but in the meantime if you happen to have Prometheus/Grafana monitoring for GPU memory setup and if you remember what the reported GPU memory usage was when the out of memory errors were triggered that would be helpful information.

Unfortunately i don’t have prometheus setup on my remote transcoder node. Happy to setup if it would help test. I can provide all logs from Orch that day or subset if would be helpful. I will start saving logs on transcoder as well.

I also have a 1650 SUPER, 1660 gddr5 and 3080 ti.