Guide: Run Headless Linux Server with NVIDIA GPUs

Hey Transcoders!

If you are transcoding with Ubuntu Desktop or any Linux system using a Desktop Manager like GDM then I highly recommend operating your node in a headless configuration. This guide will enable you to set NVIDIA persistence mode, fan speed, and power settings using just an SSH terminal.

When running Livepeer on a default Ubuntu Desktop installation, we have noticed NVIDIA GPUs occasionally “disconnect” when events like these occur (but not always):

  • A monitor is connected or disconnected from the GPU.
  • A VNC desktop session is opened or closed.
  • Streams high on either GPU while GDM is running
  • Certain GDM software upgrades

You may verify if GPU recycles have been affecting your node by searching the system logs:
journalctl --since "2022-08-18 13:04:00" --until "2022-08-18 15:04:00" | grep 'NVIDIA(GPU-0)'

If you find entries like this then your NVIDIA GPUs are being recycled occasionally, and this is a major problem:

> Aug 14 20:04:19 whitebox /usr/libexec/gdm-x-session[10951]: (--) NVIDIA(GPU-0): DFP-0: disconnected
> Aug 14 20:04:19 whitebox /usr/libexec/gdm-x-session[10951]: (--) NVIDIA(GPU-0): DFP-0: Internal TMDS
> Aug 14 20:04:19 whitebox /usr/libexec/gdm-x-session[10951]: (--) NVIDIA(GPU-0): DFP-0: 330.0 MHz maximum pixel clock
> Aug 14 20:04:19 whitebox /usr/libexec/gdm-x-session[10951]: (--) NVIDIA(GPU-0):
> Aug 14 20:04:19 whitebox /usr/libexec/gdm-x-session[10951]: (--) NVIDIA(GPU-0): Data Export EP-HDMI-RX (DFP-1): connected
> Aug 14 20:04:19 whitebox /usr/libexec/gdm-x-session[10951]: (--) NVIDIA(GPU-0): Data Export EP-HDMI-RX (DFP-1): Internal TMDS
> Aug 14 20:04:19 whitebox /usr/libexec/gdm-x-session[10951]: (--) NVIDIA(GPU-0): Data Export EP-HDMI-RX (DFP-1): 600.0 MHz maximum pixel clock

The theory is that if NVIDIA GPU kernel drivers are being initialized (at boot time) via the X server associated with Desktop, then they are running as a child of that process. If anything happens to the GDM process or X server it can temporarily reset your NVIDIA cards at any moment. To solve for this we make the following changes:

  • Disable GDM service
  • Disable VNC service (if installed) so it doesn’t recycle repeatedly looking for a running X server
  • Configure xorg.conf to allow for a headless server
  • Edit /root/.xinitrc to configure GPUs with nvidia-settings when X server starts
  • Start a dummy X server and let it exit.

The ideal configuration for transcoding is that no other processes are utilizing your GPUs other than Livepeer. You can confirm this by running nvidia-smi

image

Let’s get started, make sure that your transcoder is not in production before proceeding:

  1. Log into node over SSH.

  2. Make a backup of your current xorg.conf file in case anything goes wrong. You can restore this and restart GDM to revert changes if needed (NOTE: Your system may not have one of these and you can skip this step if that’s the case):
    sudo cp /etc/X11/xorg.conf /etc/X11/xorg.conf.backup

  3. Confirm which processes are using GPUs:
    nvidia-smi

  4. Stop GDM and VNC server (if applicable):
    sudo systemctl stop gdm
    sudo systemctl stop vncserver-x11-serviced.service

  5. Ensure there are no processes on your GPU now:
    nvidia-smi
    image

  6. Run nvidia-xconfig to generate an xorg.conf file based on currently installed hardware:
    sudo nvidia-xconfig --allow-empty-initial-configuration --enable-all-gpus --cool-bits=7

  7. Edit the X server init script:
    sudo nano /root/.xinitrc

  8. These commands will run each time X server is started. nvidia-settings will not work unless it is connected to a default display (X server), so we apply these commands here and start X manually to execute them. It is recommended to start with some query commands first and confirm the output. This will give you an idea of the GPU/Fan indexes and if the commands are working:
    nvidia-settings -q fans
    nvidia-settings -q GpuPowerMizerMode

  9. Start the X server:
    sudo startx -- :1

  10. X server will start, commands will run, print output and then shutdown.

  11. Based on your findings, apply a configuration similar to this:

nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUTargetFanSpeed=100
nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:1]/GPUTargetFanSpeed=100
nvidia-settings -a [gpu:1]/GPUFanControlState=1 -a [fan:0]/GPUTargetFanSpeed=100
nvidia-settings -a [gpu:1]/GPUFanControlState=1 -a [fan:1]/GPUTargetFanSpeed=100
nvidia-settings -a "[gpu:0]/GpuPowerMizerMode=1"
nvidia-settings -a "[gpu:1]/GpuPowerMizerMode=1"
nvidia-smi -pm 1

  • GPUFanControlState enables the fan speed to be configured manually
  • GPUTargetFanSpeed sets the fan speed for the given GPU and Fan index (some GPUs have multiple fans)
  • GpuPowerMizerMode set the powermizer setting the Max Performance. We believe this improves transcode times. This isn’t overclocking, it is simply turning off the power management feature.
  • nvidia-smi -pm 1 - Set the NVIDIA drivers to Persistence Mode (most important)
  1. To test your configuration, run the X server again:
    sudo startx -- :1

  2. You should see something successful like this:

Attribute 'GPUFanControlState' (whitebox:1[gpu:0]) assigned value 1.
Attribute 'GPUTargetFanSpeed' (whitebox:1[fan:0]) assigned value 100.
Attribute 'GPUFanControlState' (whitebox:1[gpu:0]) assigned value 1.
Attribute 'GPUTargetFanSpeed' (whitebox:1[fan:1]) assigned value 100.
Attribute 'GPUPowerMizerMode' (whitebox:1[gpu:0]) assigned value 1.

  1. Confirm that Persistence Mode is turned on: nvidia-smi
    image

  2. If you’re happy with the configuration, you can fully disable GDM and VNC server at boot:
    sudo systemctl disable gdm
    sudo systemctl disable vncserver-x11-serviced.service

  3. Now start livepeer service and enjoy consistent transcoding performance! You will need to run the sudo startx -- :1 command on each system reboot to re-apply the settings, however I am sure you could create a simple service to do that automatically (as long as another X server isn’t already running). You can always start GDM and VNC server manually if you want access to the desktop, but I would recommend against doing that while transcoding.

Final Notes:
Since applying this solution we have not seen any driver recycles. This problem/solution took a long time for us to identify/solve and I hope this will save others the trouble of learning this on their own.

I do believe that Persistence Mode is likely the key here to preventing kernel driver recycles, as it is noted in the official driver documentation below and this can be set without shutting down the desktop. Nonethless, the ability to configure all nvidia-settings via SSH is an added benefit and provides a full solution to running a headless NVIDIA driver server on Ubuntu Desktop.

As noted in the NVIDIA documentation, the behavior of NVIDIA kernel drivers differs between Windows and Linux. Therefore NVIDIA Persistence Mode is not a requirement on Windows hosts.

I hope this guide is helpful to improving quality and uptime on the Livepeer network!

Additional Reading

4 Likes