Introducing CM PurplePill: Kubernetes GPU Monitoring Solution

Markku Räsänen

CM PurplePill is a lightweight Prometheus exporter that reveals per-pod GPU usage in Kubernetes—even without explicit GPU resource requests.

Overview

It’s time to give back to the open-source community again. This time, our infra specialist, Dan, has created a clever way to solve Kubernetes GPU monitoring. In this article, we discuss what the solution is, how it works, and why it’s important for the open-source community.

Github page: https://github.com/ConfidentialMind/cm-purplepill

Current Problems with Kubernetes GPU Monitoring

  • Standard Kubernetes GPU monitoring tools only track pods with explicit nvidia.com/gpu resource requests.
  • Many applications utilize GPUs without declaring these resource requests.
  • There are significant monitoring gaps where GPU usage is effectively “invisible.”

The Solution: CM PurplePill

CM PurplePill is a lightweight Prometheus exporter for NVIDIA GPU metrics in Kubernetes that tracks pod-level GPU usage without requiring explicit GPU resource declarations.

It’s important because tools like NVIDIA DCGM can’t map per-pod GPU usage without explicit nvidia.com/gpu declarations. That means you can’t reliably use features like vLLM parallel GPU sharing and assign less than 100% of a GPU while still observing correct per-pod usage. With CM PurplePill, you are free to use any parallelism parameter combinations while still monitoring actual per-pod GPU usage. CM PurplePill is also much lighter than NVIDIA DCGM and relies on fewer components.

Key Value

  • Complete visibility into all GPU workloads, including those without resource declarations.
  • Lightweight solution with a small operational footprint.
  • Full control over the monitoring stack.
  • Simple deployment as a DaemonSet on GPU-enabled nodes.

Note: The current release supports NVIDIA only, but with slight modifications we will support AMD and other GPUs in upcoming releases.

Features

  • Exposes GPU metrics in Prometheus format.
  • Shows per-pod usage of GPUs.
  • Does not rely on GPU resource declarations in the Pod manifest (*).
  • Can show GPU usage by pods that claim less than a whole GPU (*).
  • Not limited to a particular GPU vendor (*).
  • Can run as a Kubernetes DaemonSet or as a host-level service.
  • With slight modification, can monitor GPU usage in any containerized environment (not limited to Docker-like runtimes).

* Unlike the NVIDIA DCGM Prometheus metrics exporter.

Core Metrics (Explained)

  • CM_PURPLEPILL_GPU_MEMORY_TOTAL_MIB — Total GPU memory.
  • CM_PURPLEPILL_GPU_MEMORY_USED_TOTAL_MIB — Total used memory.
  • CM_PURPLEPILL_GPU_MEMORY_FREE_MIB — Free memory.
  • CM_PURPLEPILL_GPU_UTILIZATION — GPU utilization percentage.
  • CM_PURPLEPILL_GPU_MEMORY_USED_POD_MIBPod-specific memory usage.

Deployment Options

1) Kubernetes DaemonSet (Recommended)

All-in-one container deployment in Kubernetes. Runs on GPU nodes with node selector:

  • Requires hostPID access to monitor processes.
  • Works with Prometheus Operator’s ScrapeConfig. Prometheus Operator
  • Uses standard NVIDIA software for Kubernetes hosts.

2) Direct Host Installation

Deployable as a systemd service or Docker container; installable via pip or from source.

Minimal dependencies

  • Python 3.7+
  • NVIDIA drivers with the nvidia-smi tool
  • No external Python packages (standard library only)

Install docs & unit file: Installation (pip) · systemd service

CM PurplePill vs. NVIDIA DCGM

Digital sovereignty benefits explained:

FactorCM PurplePill — Open SourceNVIDIA DCGM — Proprietary
Implementation ControlOpen architecture with visibility into monitoring logicBlack-box implementation
Vendor IndependenceAdaptable for non-NVIDIA GPUs by modifying collection layerNVIDIA-specific
CustomizabilityEasily modifiable for specific environmentsConfiguration limited to provided options

Conclusion

CM PurplePill offers complete visibility into all GPU workloads, including those without resource declarations — a long-standing challenge in Kubernetes. With its lightweight design and small operational footprint, it ensures efficient monitoring without GPU usage gaps. It provides full control over the monitoring stack and a simple DaemonSet deployment model for fast integration.

Start here: ConfidentialMind/cm-purplepill

Secure self-hosted AI platform

Get started on ConfidentialMind

Start operating your own AI factory in days