While dual ethash mining and transcoding is possible on a Nvidia GPU, previous testing has demonstrated that not only does it result in a reduction in hashrate, but it also results in transcoding performance degradation. However, these previous tests used a very simple setup that involved just running the mining process and the transcoding process on the same machine using the same GPU. These two host processes create their own CUDA contexts and activity associated with each CUDA context is serialized on the GPU. As a result, mining and transcoding would not actually be executed concurrently.
Given that transcoding primarily occurs on the NVENC/NVDEC chips on the GPU as opposed to the CUDA cores, I was curious if there was a way to avoid serialization of mining and transcoding. It turns out that it is possible to avoid serialization when using two processes using CUDA Multi-Process Service (MPS). An MPS server can collect activity from multiple processes and pass it to the GPU using a single CUDA context allowing the activity from the processes to be executed concurrently (if they can run concurrently in the first place).
So, I decided to run some benchmarks for ethash mining and transcoding with MPS enabled using a Nvidia GeForce RTX 3080 10G [1]. See below for the results.
[1] MPS does not support usage of NVENC/NVDEC in pre-Volta architecture GPUs. The RTX 3080 uses the Ampere architecture which fulfills this requirement.
Benchmarks
Hardware
- GPU: 1x GeForce GTX 3080 10G
- CPU: AMD Ryzen Threadripper 1950X 16-Core Processor 2.17 GHz
- RAM: 65 GB
- Nvidia Driver: 455.23.04
- CUDA: 10.2
Testing Tools
- ethminer
- ffmpeg script
-
lpms-bench script script
- Note: Would like to standardize this soon!
- 2 minute video clip segmented into roughly 2 second segments
Both transcoding scripts accept an m3u8 playlist as input which is used to fetch segments for transcoding.
First, I used ffmpeg to quickly test a few setups. Then, I used lpms-bench to get a better sense of expected transcoding performance of a go-livepeer transcoder node since it uses the same transcoding code under the hood.
ethminer configuration
These are two configurations of ethminer that I tried out to control the amount of work sent to the GPU.
Config # | –cuda-streams | –cuda-block-size | –cuda-grid-size |
---|---|---|---|
1 | 2 (default) | 128 (default) | 8192 (default) |
2 | 1 | 64 | 4096 |
Baseline
These are the baseline metrics for standalone ethminer mining, standalone ffmpeg transcoding and standalone lpms-bench transcoding.
- ethminer config #1 hashrate = 86.29Mh
- ethminer config #2 hashrate = 80.21Mh (-7.05% relative to ethminer config #1)
- ffmpeg 1 stream transcode time = 7.66s
- lpms-bench 1 stream transcode time = 8.088s
Hashrate diff is calculated relative to the hashrate of ethminer config #1.
Transcode time diff is calculated relative to the transcode time of ffmpeg 1 stream.
ethminer + ffmpeg
All tests were run with 1 stream.
ethminer config | mps? | hashrate (Mh) | hashrate diff (%) | transcode time (s) | transcode time diff (%) |
---|---|---|---|---|---|
1 | 73.28 | -15.07 | 34.827 | +354.89 | |
2 | 69.72 | -19.2 | 17.892 | +133.69 | |
1 | 86.29 | 0 | DNF | N/A | |
2 | 78.32 | -9.23 | 10.671 | +39.38 |
Using MPS with ethminer config #1 (default) actually resulted in terrible transcoding performance. Transcoding did not even finish! The output of nvidia-smi dmon
during this test showed that the encoder and decoder utilization was consistently at 1%. I suspect this had to do with the fact that ethminer maxed out streaming multiprocessor and memory utilization preventing much additional activity from happening concurrently.
Meanwhile, using MPS with ethminer config #2 resulted in a dramatic improvement in transcoding performance and a nice improvement in hashrate as well. Transcoding time in this test was over 3x faster than in the simple dual mining + transcoding test (no MPS, using ethminer config #1). Hashrate in this test was 1.06x higher than in the simple dual mining + transcoding test. I suspect that switching to ethminer config #2 reduced the streaming multiprocessor utilization by a little bit which allowed more activity to be executed concurrently.
ethminer + lpms-bench
All tests were run with ethminer config #2 and MPS enabled.
# streams | hashrate (Mh) | hashrate diff (%) | transcode time (s) |
---|---|---|---|
1 | 78.64 | -8.86 | 11.874 |
2 | 78.32 | -9.24 | 20.85 |
3 | 77.75 | -9.9 | 30.146 |
This setup transcodes 3 concurrent streams faster than the simple dual mining + transcoding setup transcodes 1 stream and also achieves a higher hashrate.
ethminer + lpms-bench (simulate live stream)
All tests were run with lpms-bench simulating a live stream by waiting the duration of each segment before submitting it to the GPU. I did not record the transcode times because I expect them to be the same as the ones recorded in the previous section since the only difference in these tests is that a delay was added between between segment submission to the GPU.
# streams | hashrate (Mh) | hashrate diff (%) |
---|---|---|
1 | 79.22-80.34 | -6.89-8.19 |
2 | 79.05-80.16 | -7.1-8.39 |
3 | 79-79.91 | -7.39-8.44 |
There are more small fluctuations in hashrate when simulating a live stream likely due to the variable rate that segments are submitted to the GPU which depends on the duration of each segment.
Simulating a live stream seems to improve hashrate. I think this makes sense because the delay between segments should result in more periods of time where the GPU is only mining.
Observations
While the benchmarks above only cover a small range of configurations, they demonstrate that that you can substantially improve dual ethash mining and transcoding performance by using MPS and tweaking ethminer parameters at least on a post-Volta GPU such as the GeForce RTX 3080 10G. I suspect there is a lot more room for optimizations here. Better dual ethash mining and transcoding performance means that transcoders can potentially transcode more streams (and earn more) while continuing to mine!