Skip to content

[Bug]: Podman fails with "unresolvable CDI devices" on Jetson Orin in nvidia-container-toolkit > 1.14.5-1 #1853

@skiman6010

Description

@skiman6010

I am having a weird issue with podman and the nvidia-container-toolkit. We are using podman quadlets to deploy our services onto Jetson Orin NXs. Unfortunately because podman doesn't provide a ppa, we are stuck on 4.6.2 until nvidia releases 24.04 for the jetson orin boards.

A sample for triton-server on how we deploy:

[Unit]
Description=Triton Server
Wants=network-online.target
After=network.target triton-load.service

[Container]
Image=docker.company.com/company-ecr/nvidia/tritonserver:25.05-py3-igpu
Notify=false
Pull=never
ContainerName=triton-server
HostName=triton-server
User=0:0
Volume=/opt/volumes/model_repository:/model_repository
Volume=/opt/volumes/tritoncache:/mnt/triton/cache
AddDevice=nvidia.com/gpu=all
PublishPort=8085:8000
PublishPort=8086:8001

Exec=tritonserver --model-repository=/model_repository --model-control-mode=explicit --load-model=*

LogDriver=json-file
PodmanArgs=--log-opt max-size=10mb

[Service]
Restart=always
RestartSec=5
TimeoutStartSec=4500
TimeoutStopSec=70
ExecStartPre=nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml --mode=csv

[Install]
WantedBy=multi-user.target

This is placed in /etc/containers/systemd and run as root through systemd
This has been working great for us so far, but recent toolkit versions have slowly started to degrade our uptime. We have recently run into a game stopping issue with generating the cdi... on toolkit version > 1.14.5-1 these are the errors we are seeing. Triton container spins and thrashes.

May 22 18:37:28 tegra-ubuntu triton-server[620324]: time="2026-05-22T18:37:28Z" level=warning msg="Could not locate /usr/lib/aarch64-linux-gnu/tegra: /usr/lib/aarch64-linux-gnu/tegra: not found"
...
bad magic number '[47 42 10 32]' in record at byte 0x0"
...
Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all

The fix we have been doing on devices is

sudo apt-get purge -y nvidia-container-toolkit nvidia-container-toolkit-base
sudo apt-get install -y nvidia-container-toolkit=1.14.5-1 nvidia-container-toolkit-base=1.14.5-1
sudo nvidia-ctk cdi generate --mode=csv --output=/etc/cdi/nvidia.yaml

And this has been working. I will note this is happening on random devices, sometimes it doesn't happen at all on devices with the exact same specs and package versions across the board. This all feels like symptoms for a larger incompatibility. Not sure what has changed recently to cause our issues but some advice would be appreciated! Thanks!

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions