The Livepeer Inc. team has been working on enabling fast verification for transcoding on the network. A few weeks ago, go-livepeer 0.5.21 was released which included orchestrator/transcoder support for computing MPEG-7 video signatures, the perceptual hash algorithm that is used in the first version of fast verification. Since then, the Livepeer Inc. team has been developing the broadcaster implementation of the fast verification algorithm as described in the fast verification design.
On 10/21/21, the Livepeer Inc. team began running tests on the network using the latest broadcaster implementation of the fast verification algorithm. During these tests, a number of orchestrator operators reported that their orchestrators crashed and after reviewing some of the provided error logs it became apparent that the orchestrators were encountering errors while computing video signatures on the GPU. The video signature capability had been already tested previously, but there were a few gaps:
- The tests were only run on GPUs that the Livepeer Inc. team has access to
- The tests were only run on GPUs in a Linux environment
While we are still investigating the potential impact of the GPU model on the video signature capability, we do know that many of the errors that orchestrators encountered were caused by a Windows specific issue and a fix has already been implemented and is currently being tested.
Next Steps
The disruption to orchestrator operation caused by these fast verification tests was a problem and the team will use the lessons from this experience to make the required fixes and rollout fast verification in a less disruptive way. The next steps are:
- Complete the fix for the Windows specific issue with the video signature capability for orchestrators/transcoders
- Complete the investigation of any non-Windows specific issues with the video signature capability reported by orchestrator operators
- Run additional tests for the video signature capability in both Linux and Windows environments
- Run another fast verification test on the network once the above steps are complete
If you observed any crashes and/or error logs containing cudasign
it would be very helpful if you could share more information such as additional error logs either in this thread or in Github.