I am having a weird issue with podman and the nvidia-container-toolkit. We are using podman quadlets to deploy our services onto Jetson Orin NXs. Unfortunately because podman doesn't provide a ppa, we are stuck on 4.6.2 until nvidia releases 24.04 for the jetson orin boards.
A sample for triton-server on how we deploy:
[Unit]
Description=Triton Server
Wants=network-online.target
After=network.target triton-load.service
[Container]
Image=docker.company.com/company-ecr/nvidia/tritonserver:25.05-py3-igpu
Notify=false
Pull=never
ContainerName=triton-server
HostName=triton-server
User=0:0
Volume=/opt/volumes/model_repository:/model_repository
Volume=/opt/volumes/tritoncache:/mnt/triton/cache
AddDevice=nvidia.com/gpu=all
PublishPort=8085:8000
PublishPort=8086:8001
Exec=tritonserver --model-repository=/model_repository --model-control-mode=explicit --load-model=*
LogDriver=json-file
PodmanArgs=--log-opt max-size=10mb
[Service]
Restart=always
RestartSec=5
TimeoutStartSec=4500
TimeoutStopSec=70
ExecStartPre=nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml --mode=csv
[Install]
WantedBy=multi-user.target
This is placed in /etc/containers/systemd and run as root through systemd
This has been working great for us so far, but recent toolkit versions have slowly started to degrade our uptime. We have recently run into a game stopping issue with generating the cdi... on toolkit version > 1.14.5-1 these are the errors we are seeing. Triton container spins and thrashes.
May 22 18:37:28 tegra-ubuntu triton-server[620324]: time="2026-05-22T18:37:28Z" level=warning msg="Could not locate /usr/lib/aarch64-linux-gnu/tegra: /usr/lib/aarch64-linux-gnu/tegra: not found"
...
bad magic number '[47 42 10 32]' in record at byte 0x0"
...
Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all
The fix we have been doing on devices is
sudo apt-get purge -y nvidia-container-toolkit nvidia-container-toolkit-base
sudo apt-get install -y nvidia-container-toolkit=1.14.5-1 nvidia-container-toolkit-base=1.14.5-1
sudo nvidia-ctk cdi generate --mode=csv --output=/etc/cdi/nvidia.yaml
And this has been working. I will note this is happening on random devices, sometimes it doesn't happen at all on devices with the exact same specs and package versions across the board. This all feels like symptoms for a larger incompatibility. Not sure what has changed recently to cause our issues but some advice would be appreciated! Thanks!
I am having a weird issue with podman and the nvidia-container-toolkit. We are using podman quadlets to deploy our services onto Jetson Orin NXs. Unfortunately because podman doesn't provide a ppa, we are stuck on 4.6.2 until nvidia releases 24.04 for the jetson orin boards.
A sample for triton-server on how we deploy:
This is placed in
/etc/containers/systemdand run as root through systemdThis has been working great for us so far, but recent toolkit versions have slowly started to degrade our uptime. We have recently run into a game stopping issue with generating the cdi... on toolkit version > 1.14.5-1 these are the errors we are seeing. Triton container spins and thrashes.
The fix we have been doing on devices is
And this has been working. I will note this is happening on random devices, sometimes it doesn't happen at all on devices with the exact same specs and package versions across the board. This all feels like symptoms for a larger incompatibility. Not sure what has changed recently to cause our issues but some advice would be appreciated! Thanks!