Skip to content

Thermovision monitors physical GPU 0 instead of profiled CUDA-visible device #35

Description

@lucifer1004

Version: nsight-python 0.9.6
Scenario: multi-GPU system
Env: CUDA_VISIBLE_DEVICES=4
Observed: @nsight.analyze.kernel default thermal_mode="auto" waits on physical GPU 0 temperature
Expected: Thermovision should monitor the profiled CUDA device, or honor CUDA_VISIBLE_DEVICES, or expose an explicit thermal device option
Impact: Profiling can hang/timeout before the annotated kernel launches when GPU 0 is hot/busy but the profiled GPU is idle
Workaround: pass thermal_mode="off"

Evidence: direct ncu profiled the GEMM in seconds; nsight-python default path timed out after 300 s with “No kernels were profiled”; after thermal_mode="off", the same candidate completed in ~12.6 s and captured all metrics.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions