Discussion: Orchestrator Incident Escalation Pipeline

Hi everyone,

My name is Alex, but some of you may know me as “syrinx node”.

As a TPM at LivePeer and an orchestrator one pain-point in our community is efficiently communicating / raising issues with LivePeer inc and engineers. Critical issues could be breaking-changes to Go-LivePeer or an ingest node going down for a few hours. Less critical issues could be things like UI improvements to LivePeer explorer, persistent bugs (non-breaking) with Go-LivePeer or future feature requests that aren’t quite defined enough to be submitted as a GitHub issue.

As LivePeer org has grown keeping communication efficient and balanced with ample context when it comes to engineering is paramount. Unfortunately, it’s become more and more difficult to handle this process solely with Github issues and scattered Discord threads. That said, we love discord and would like to propose a draft solution to make this process as easy as possible for orchestrators.

Below is v0.1 of this process, if you have suggestions or disagree with the approach please respond below and ideally in a week or so we’ll have something we can deploy.

O Incident Escalation Pipeline

The format of this first iteration will be a ticketing Discord plugin, tickets will be created directly from discord and assigned a ticket number.

Creating a ticket will require:

  • category describing the nature of the issue
    • incident / outage
    • service abnormality / broken feature
    • feature request
    • long term eng request (in response to broken feature or service abnormality)
  • description including context of issue (will require a minimum length)
  • indication of severity
    • this may be removed, but if used properly would help triage faster
      Low - service is functioning well, but would greatly benefit by raising this issue
  • should require more context than others

  • each should require a minimum length explanation / description of what is requested

    • Low - service is functioning well, but would greatly benefit by raising this issue
      • should require more context than others
      • each should require a minimum length explanation / description of what is requested
    • Medium - service is functioning but sub-optimally
    • High - service is down / immediately negatively affecting LivePeer network

Once a ticket is submitted, it’s associated ticket number should be used when other discord users are referring to a similar issue. Ideally, this will help understand how wide spread the issue is, instead of multiple users submitting identical tickets.

Tickets will for now be triaged by myself or other members of the product team and then triaged to engineering as necessary. For instance, urgent queries regarding transcoder infrastructure (ingest nodes, test streams etc) will be routed with higher priority than feature requests or UI bugs.

When a ticket is completed, an response will be posted with the ticket number. Ideally, this will make responses easier to find and help keep a better track record of recurring errors or queries.

Again, this is a draft proposal - we’re curious of your feedback and suggestions to improve this before it’s initial implementation next week.

Eager to hear your feedback!

4 Likes

I think a ticket system is a great idea and that keeping it in Discord is the right approach as anything off-platform seems to get less engagement (at least for now).

Looking forward to seeing this implemented!

1 Like

Great idea! I agree that keeping it within discord is a good idea from an orchestrator’s perspective, but I wonder if it might be time consuming to keep tickets up-to-date or remove duplicates. Maybe we can enlist the help of some community members to help out with managing the first line of tickets?

I think another thing which would go hand in hand with this is a very basic status webpage, where any of the larger known issues and their effects are listed in one place. For example, if ingest nodes are known to be down for a while (like fra and lax now) or error like the ZeroSegments error. There is an existing website which isn’t being updated, but might be able to be repurposed for this matter

2 Likes

I agree with all of the above and will volunteer to help manage “the first line of tickets” if that’s a direction that is decided to go.

1 Like

Thanks stronk,

I can definitely get an update as to whether the statuspage.io site is being properly updated / whether or not we’re actually reporting incidents there.

I’d definitely be open to identifying a few community members (as others here have mentioned as well) to help cover the primary time-zones where most orchestrators are, to help dedupe / triage in discord.

1 Like

This looks great, thanks for putting this together Alex. You mentioned the Discord plugin, is there one in particular which you’ve been considering? It’s a great idea to link the original communication with the issue.

Also, I know the team relies pretty heavily on GH issues, are there any plugins similar to ZenHub which deeply integrate with GH issues? Or would this be a separate tool which gets triaged into Eng. If the latter, I’d recommend checking out Linear as a triaging tool.

Love the idea :slight_smile:
can’t wait to see the implementation

1 Like

Thanks Chase!

Impetus is to ease some triage pressure from the eng team and also help ensure orchestrators are getting support in a more complete loop.

Ideally, the incident pipeline / ticketing layer here will be a middleware between the community and GitHub issues. We’ll obviously still keep issues open but generally try to escalate faster via the incident pipeline - ideally with more consistent scoping / structure. In theory, only serious incidents will become GitHub issues quickly and other “nice to haves” will become GitHub issues with better context.

We haven’t selected a plugin yet although it’s great to see someone else suggest Linear, it’s such a well built tool although for our first iteration I’m not sure we’re going to add another service.

Any updates on this? Lack of communication with regards to ingest nodes being taken out of rotation is still a major pain point (I can’t remember if we ever had all ingest nodes in production for a sustained period of time since the Confluence upgrade, but there has never been any communication on when/why ingest nodes are taken down)

Not leveraging the Orchestrator community to identify issues is also a missed opportunity as they are usually the first to notice an issue due to their elaborate monitoring setups

3 Likes