Hey all, here’s a guide on how to setup prometheus/grafana monitoring for your GPU transcoder.
The dashboards and the Nvidia exporter are heavily inspired/copied from the “official” Livepeer Monitoring - so many thanks to the original creator! You might want to try to run their docker solution first. I was looking for a more modular approach since I was already running Grafana/Prometheus for other stuff. Also a thank you to @yondon for answering my questions on Discord
I hope this guide works for you - please comment if you have any questions/problems. Let’s get started:
Grafana already comes with a systemd service that you can start with sudo systemctl start grafana-server
Now you should have access to “your-IP”:3000 in your browser (might need to open the port in your server’s firewall settings). The default user and password is “admin”, change accordingly.
Once logged in, first add Prometheus under “configuration → data sources → add data source”. The URL is http://localhost:9090
Next go to “Dashboards → Manage → Import” and paste the json of the dashboards that you want: Dashboards
hei, I tried to set it all up. The nvidia exporter works but the livepeer metrics doesnt. if I check the localhost:7935/metrics I cant see anything there but if i use 127.0.0.1:7935 I can see the metrics. How do I change the settings so that prometheus sees the metrics from 127.0.0.1:7935 instead of localhost which when i open in browser resolved into jason-pc:7935 which is my pc name.
Since Grafana showed that I’ve received a winning ticket but my Orchestrator didn’t automatically redeem it, I’ve decided to set up an alert in Grafana that notifies me about a winning ticket. Here’s how you do it:
Set up a notification channel by going to “Alerting” → “Notification channels” on the left menu bar. Click on “New channel”
Create/edit the “Winning Tickets” time series panel. The query should be this: sum(livepeer_winning_tickets_recv OR on() vector(0)). The sum and vector(0) is that prometheus returns 0 instead of no data when there is no winning ticket.
Switch from query to the “Alert” tab. Set the conditions equal to WHEN diff() OF query(A, 15m, now) IS ABOVE 0, edit the “Send to” to add your notification channel that you added in the first step, adjust the message to your liking and that’s it
Amazing, thanks for this!
Would you consider adding avg GPU power draw to the Nvidia dashboard?
I have an Nvidia made P400 and should be receiving a PNY P400 V2 revision (which has lower power consumption) and it would be nice to plot the difference in Grafana. It would also help to create another plot (based on the user’s actual currency) to see how much electricity the transcoding costs.
Something like nvidia-smi --query-gpu=power.draw --format=csv?
Just wanted to also add that you can use the free hosted instance of Grafana’s own cloud instead of installing it locally. I’ve installed prometheus to listen on my LAN IP, setup port forwarding on the router and Basic Auth for prometheus and it works well.
To add the power, you’d have to adjust the nvidia_exporter.go script and add this to the metrics function. Shouldn’t be to hard I think, have you tried it already?
I did try, but do not know Go and gave up after a few failed attempts and asked here. Appears there’s more to it than simply adding power.draw to the list in --query-gpu=name,index,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used. Think I need to add it to the metric list too, unsure…
Yes, the default setting/setup requires that prometheus is run on the Orchestrator box. I’m also running grafana on the Orchestrator - those two processes don’t require that much resources and it’s the most simple setup.
It is possible to setup remote monitoring, that’s how I do it. You just need to ensure port 7935 (or whatever port you set as the cliAddr) is open and set the listening address to zeros so it accepts requests from all IPs: -cliAddr 0.0.0.0:7935.
In the prometheus.yml config, instead of ‘localhost:7935’, set the targets to :7935.
I don’t think it’s safe to just have the cli port open to the internet so I use a firewall to only accept traffic from my monitoring node.
I’m trying to make sure that this is working correctly. Does it need to be done using a certain one of the dashboards or does it work with all of them? I tried to evaluate the query and I see this error.
I do not see the variable “livepeer_winning_tickets_recv” anywhere on the metrics browser.