Protocol Bug Fix & Upgrade Summary


#1

Almost one month into Livepeer’s Snowmelt Alpha release, the first bug in the live protocol smart contracts was discovered. Refer to technical details here. This was a non-critical bug, which was not exploited, and was discovered internally and patched over a 24 hour period. These sorts of issues are expected during the alpha, will likely occur again, and they are the exact reason that the protocol is pausable and upgradable at this early phase. This post aims to provide transparency around the process used to discover, asses, and fix the issue, so that everyone can participate in, and improve upon this process going forward.

The Issue
In Livepeer, transcoders must bond token towards themselves to become registered and active. After they have bonded however, the protocol allowed them to delegate their stake towards another transcoder, essentially leaving nothing at stake on their own node. This created multiple weird accounting side effects, and one exploitable opportunity for a malicious actor due to a bug.

Please review the full technical details to understand the accounting side effects. In short, there were three related issues which surfaced due to transcoders being able to delegate away from themselves:

  • If the original transcoder unbonded, their stake would not be subtracted from the node which they delegated towards. (This could be exploited to earn more inflationary LPT than deserved).
  • The transcoder who re-delegated was able to claim their token rewards as a delegator, but was not able to access the rewards from their transcoding node.
  • The unclaimable transcoding node rewards would not be accessible to withdraw, but will still apply as stake for that node forever.

The Effect
One node in the network found themselves in this situation unintentionally. They did not execute the unbond method which would have unknowingly created the stake accounting issue, as we discovered the error internally after they reported inconsistent behavior. The effects were that:

  • We had to deploy the protocol update before they unbonded, lest the stake accounting get out of whack.
  • The node was unable to access some of the reward token that it would have earned as a transcoder during the time that it was not delegated towards itself. (Please see full technical recap of the issue to follow along with the details).

The Process
Upon discovering this issue, some core team members entered into the following process:

  • Identify top priorities - protect protocol user value and trust in the protocol.
  • Analyze the issue to understand it.
  • Assess the various options for addressing the issue, including deciding whether to pause the protocol.
  • Pause the protocol.
  • Communicate the pause publicly, and steps for resuming - assuming a 2 day pause window with possibility of it being resolved quicker.
  • Review and test the proposed fix - in this case a couple line code change - tested on Rinkeby test network.
  • Deploy the fix
  • Verify
  • Unpause the protocol
  • Communicate resumption of the network - occurred about 14 hours after the initial pause. No protocol rounds were missed.

The Fix
To address these issues, there was a small update proposed for the protocol:

  • Transcoders can not delegate their stake towards anyone else without first resigning their node.
  • When unbonding, make sure to subtract the stake from the transcoding node, even in the case where the delegator was also a transcoder.

The Impact

  • No one was able to exploit this issue for personal, malicious gain.
  • A single transcoding node did lose out on multiple rounds worth of rewards that it otherwise would have earned had it not delegated away from itself, or had the accounting logic to handle this case been correct.
  • Going forward, as it should be impossible to delegate away from oneself as a transcoder, this should not impact any one.

Bug Bounty
The affected node will be receiving a bug bounty for a “Medium” issue, as they helped surface the issue and complied with a responsible disclosure protocol. Thank you!

Lessons Learned

  • The Livepeer core team executed on the defined process in responding to the discovery of protocol issues (over a holiday weekend :), and got to an effective outcome before the issue could be exploited.
  • However it still did take over 24 hours to get to a live update for a relatively straightforward change.
  • The proxy-upgrade smart contract mechanism worked on mainnet, and served its purpose for allowing protocol upgrades in the wild.
  • This sort of issue will likely happen again. The network is young, and the protocol is complex. These processes and mechanisms are in place so that we can be prepared and mitigate issues quickly. Let’s aim to be faster and more efficient next time.

Thanks to those who worked hard over the past few days to mitigate the impact of this issue. And thanks to the Berlin Community node for helping to find and report this issue in a responsible way.


Network Economics Update - 6/4/18