Protocol Accounting Bug + Fix Retrospective


#1

About 4 months into Livepeer’s Snowmelt Alpha release, the second bug in the live protocol smart contracts was discovered. Refer to technical details here . This was a non-critical bug, which could not be actively exploited at the dramatic expense of other users. It was discovered externally and patched over a 72 hour period. These sorts of issues are expected during the alpha, will likely occur again, and they are the exact reason that the protocol is upgradable at this early phase. This post aims to provide transparency around the process used to discover, asses, and fix the issue, so that everyone can participate in, and improve upon this process going forward.

The Issue
Inflationary token in Livepeer is generated each round (each day), and is intended to be split up amongst transcoders and delegators in proportion to how much LPT they have staked, after subtracting a rewardCut which is withheld by the transcoder. This split needs to be accounted for by each user eventually “claiming” their rewards for each round, which updates the internal accounting to move the inflationary token from a big pool into the user’s specific accounting data structure. The reason this claiming process is necessary has to do with Ethereum gas limits, and the need to split up all this computation across many transactions instead of all at once.

The intention is that the order that delegates and transcoders “claim” has no impact in the end amount of token received, as it is supposed to be entirely in proportion to stake. However due to a bug in how the transcoder’s portion is calculated, the order in which calls were made actually had slight impacts on the amount that delegates and transcoders received, prior to the patch. As a result, some users saw that their “pending stake” would show one number, but if they returned later they may have seen it increase or decrease.

Please review the full technical details to understand the accounting side effects.

The Effect
This issue should be classified as medium severity - it affected everyone slightly, however could not be actively exploited in the short term at the dramatic negative expense of another user. Here are some takeaways of the effects:

  • no one’s actual staked token was at risk - only future potential inflationary earnings.
  • some nodes ended up with slightly more inflationary LPT than in proportion to stake, and some ended up with slightly less. Every round was unique and the order of claiming matters, so some users likely saw increases in one round and decreases in the next.
  • the biggest observable impact would have been to transcoding nodes with many delegates where the transcoding node was late to claim for a round relative to all the other delegates. This is an unlikely scenario during the early stages of the network.
  • there are likely a couple thousand permanently staked inflationary LPT spread across all transcoders that don’t belong to anyone and aren’t withdrawable - think of this as the equivalent of a lost private key after staking.

The Process
Upon discovering this issue, some core team members entered into the following process:

  • Identify top priorities - protect protocol user value and trust in the protocol.
  • Analyze the issue to understand it.
  • Assess the various options for addressing the issue, including deciding whether to pause the protocol.
  • Decided that pausing the protocol would not be helpful, as any lost rounds would be more harmful to participants than the slight variations in inflationary issuance.
  • Communicate the issue publicly, and steps for resolution - assuming a 48-72 hour fix window.
  • Communicate a minimal mitigation strategy that active transcoders could use to minimize the effects of the issue while it was still active.
  • Review and test the proposed fix - in this case a moderate update to the protocol smart contracts - tested on Rinkeby test network and with significant unit and integration tests.
  • Deploy the fix
  • Verify
  • Communicate the update and fixed accounting - occurred in just under 72 hours from the initial report.

The Fix
To address these issues, there was a small update proposed for the protocol:

  • Accounting for transcoder and delegator reward cuts and fee shares would be split from one pool into multiple pools. This way only the transcoder could impact the transcoder share by withdrawing the entire thing, and delegator claims on the pool could not effect the calculations in an erroneous way.
  • Introduce additional integration tests that exercise the ordering of claims and ensuring that certain invariants are met in the distribution of inflationary token.

Bug Bounty
Two independent delegators reported suspicious accounting around pending stake which caused the team to identify this issue, and they will each be receiving a bug bounty for a “Medium” issue, as they helped surface the issue and complied with a responsible disclosure protocol. Thank you!

Lessons Learned

  • The Livepeer core team executed on the defined process in responding to the discovery of protocol issues, and got to an effective fix before the issue could be exploited.
  • The proxy-upgrade smart contract mechanism worked on mainnet, and served its purpose for allowing protocol upgrades in the wild.
  • This sort of issue will likely happen again. The network is young, and the protocol is complex. These processes and mechanisms are in place so that we can be prepared and mitigate issues quickly. Let’s aim to be faster and more efficient next time.
  • It is difficult to quantify the exact impact on each individual user without implementing a full simulation of the protocol correctly (which is the intended implementation of the actual protocol in the first place). While two users reported variations in their pending stake, not a single user actually reported incorrect accounting on their inflation according to expectations. This means that users don’t have great visibility into what they should expect to be generating, vs what they’re actually generating, and could benefit from better tools.
  • Tests and assert statements in the protocol itself could do a better job of ensuring certain invariants hold, such as the total inflationary token being distributed amongst transcoders + delegators after all claims are made.

Thanks to those who worked hard over the past few weeks to mitigate the impact of this issue. And thanks to those who helped find and report the issue in a responsible way.