85 lines
3.5 KiB
Markdown
85 lines
3.5 KiB
Markdown
# `nvidia-smi` Input Plugin
|
|
|
|
This plugin uses a query on the [`nvidia-smi`](https://developer.nvidia.com/nvidia-system-management-interface) binary to pull GPU stats including memory and GPU usage, temp and other.
|
|
|
|
### Configuration
|
|
|
|
```toml
|
|
# Pulls statistics from nvidia GPUs attached to the host
|
|
[[inputs.nvidia_smi]]
|
|
## Optional: path to nvidia-smi binary, defaults to $PATH via exec.LookPath
|
|
# bin_path = "/usr/bin/nvidia-smi"
|
|
|
|
## Optional: timeout for GPU polling
|
|
# timeout = "5s"
|
|
```
|
|
|
|
#### Windows
|
|
|
|
On Windows, `nvidia-smi` is generally located at `C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe`
|
|
On Windows 10, you may also find this located here `C:\Windows\System32\nvidia-smi.exe`
|
|
|
|
You'll need to escape the `\` within the `telegraf.conf` like this: `C:\\Program Files\\NVIDIA Corporation\\NVSMI\\nvidia-smi.exe`
|
|
|
|
### Metrics
|
|
- measurement: `nvidia_smi`
|
|
- tags
|
|
- `name` (type of GPU e.g. `GeForce GTX 1070 Ti`)
|
|
- `compute_mode` (The compute mode of the GPU e.g. `Default`)
|
|
- `index` (The port index where the GPU is connected to the motherboard e.g. `1`)
|
|
- `pstate` (Overclocking state for the GPU e.g. `P0`)
|
|
- `uuid` (A unique identifier for the GPU e.g. `GPU-f9ba66fc-a7f5-94c5-da19-019ef2f9c665`)
|
|
- fields
|
|
- `fan_speed` (integer, percentage)
|
|
- `memory_free` (integer, MiB)
|
|
- `memory_used` (integer, MiB)
|
|
- `memory_total` (integer, MiB)
|
|
- `power_draw` (float, W)
|
|
- `temperature_gpu` (integer, degrees C)
|
|
- `utilization_gpu` (integer, percentage)
|
|
- `utilization_memory` (integer, percentage)
|
|
- `pcie_link_gen_current` (integer)
|
|
- `pcie_link_width_current` (integer)
|
|
- `encoder_stats_session_count` (integer)
|
|
- `encoder_stats_average_fps` (integer)
|
|
- `encoder_stats_average_latency` (integer)
|
|
- `clocks_current_graphics` (integer, MHz)
|
|
- `clocks_current_sm` (integer, MHz)
|
|
- `clocks_current_memory` (integer, MHz)
|
|
- `clocks_current_video` (integer, MHz)
|
|
|
|
### Sample Query
|
|
|
|
The below query could be used to alert on the average temperature of the your GPUs over the last minute
|
|
|
|
```
|
|
SELECT mean("temperature_gpu") FROM "nvidia_smi" WHERE time > now() - 5m GROUP BY time(1m), "index", "name", "host"
|
|
```
|
|
|
|
### Troubleshooting
|
|
|
|
Check the full output by running `nvidia-smi` binary manually.
|
|
|
|
Linux:
|
|
```
|
|
sudo -u telegraf -- /usr/bin/nvidia-smi -q -x
|
|
```
|
|
|
|
Windows:
|
|
```
|
|
"C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" -q -x
|
|
```
|
|
|
|
Please include the output of this command if opening an GitHub issue.
|
|
|
|
### Example Output
|
|
```
|
|
nvidia_smi,compute_mode=Default,host=8218cf,index=0,name=GeForce\ GTX\ 1070,pstate=P2,uuid=GPU-823bc202-6279-6f2c-d729-868a30f14d96 fan_speed=100i,memory_free=7563i,memory_total=8112i,memory_used=549i,temperature_gpu=53i,utilization_gpu=100i,utilization_memory=90i 1523991122000000000
|
|
nvidia_smi,compute_mode=Default,host=8218cf,index=1,name=GeForce\ GTX\ 1080,pstate=P2,uuid=GPU-f9ba66fc-a7f5-94c5-da19-019ef2f9c665 fan_speed=100i,memory_free=7557i,memory_total=8114i,memory_used=557i,temperature_gpu=50i,utilization_gpu=100i,utilization_memory=85i 1523991122000000000
|
|
nvidia_smi,compute_mode=Default,host=8218cf,index=2,name=GeForce\ GTX\ 1080,pstate=P2,uuid=GPU-d4cfc28d-0481-8d07-b81a-ddfc63d74adf fan_speed=100i,memory_free=7557i,memory_total=8114i,memory_used=557i,temperature_gpu=58i,utilization_gpu=100i,utilization_memory=86i 1523991122000000000
|
|
```
|
|
|
|
### Limitations
|
|
Note that there seems to be an issue with getting current memory clock values when the memory is overclocked.
|
|
This may or may not apply to everyone but it's confirmed to be an issue on an EVGA 2080 Ti.
|