HPA with Custom GPU Metrics

Overview

The Kubernetes Horizontal Pod Autoscaler (HPA) is a fundamental component of Kubernetes that automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed resource utilization or other custom metrics.


HPA Installation Guide

The Horizontal Pod Autoscaler is a built-in feature of Kubernetes, so there is no separate “installation” required for the controller itself. However, it relies on the Metrics Server to function correctly.

Step 1: Install the Metrics Server (Prerequisite)

The Metrics Server is a crucial component that collects resource usage data (CPU, memory) from all nodes and pods, which the HPA then uses to make scaling decisions.

Note: You can install the Metrics Server and the whole MLA stack in KKP by enabling User Cluster Monitoring checkbox in the Cluster settings. More information can be found here.

Once running, you can test it by checking if you can retrieve node and pod metrics:

kubectl top nodes
kubectl top pods

Step 2: Configure Resource Requests

The HPA scales based on a percentage of the defined resource requests. If your Deployment does not have CPU requests defined, the HPA will not be able to function based on CPU utilization.

Ensure your workload’s YAML file (Deployment, ReplicaSet, etc.) includes a resources: requests block:

# Snippet from your Deployment YAML
spec:
  template:
    spec:
      containers:
        - name: my-container
          image: k8s.gcr.io/hpa-example # A simple example image
          resources:
            requests:
              cpu: "200m"  # 200 milliCPU (0.2 CPU core)
            limits:
              cpu: "500m"  # Optional, but recommended

Step 3: Deploy the Horizontal Pod Autoscaler (HPA)

You can deploy the HPA using either a simple command or a declarative YAML file.

Option A: Using the kubectl autoscale Command (Quick Method)

This is the fastest way to create an HPA resource:

kubectl autoscale deployment [DEPLOYMENT_NAME] \
  --cpu-percent=50 \
  --min=2 \
  --max=10
  • [DEPLOYMENT_NAME]: Replace this with the actual name of your Deployment.
  • --cpu-percent=50: The HPA will try to maintain an average CPU utilization of 50% across all pods.
  • --min=2: The minimum number of replicas.
  • --max=10: The maximum number of replicas.

For complex configurations (like scaling on memory or custom metrics), a YAML manifest is better. We recommend using the autoscaling/v2 API version for the latest features.

hpa-config.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    # Target the resource that needs to be scaled
    apiVersion: apps/v1
    kind: Deployment
    name: hpa-demo-deployment # <-- REPLACE with your Deployment name

  minReplicas: 2
  maxReplicas: 10

  metrics:
    # Metric 1: Scale based on CPU utilization
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50 # Target average 50% CPU utilization

    # Metric 2: Scale based on Memory utilization (optional)
    - type: Resource
      resource:
        name: memory
        target:
          type: AverageValue
          averageValue: 300Mi # Target average of 300 MiB of memory usage

Apply the HPA:

kubectl apply -f hpa-config.yaml

Step 4: Verify the HPA Status

Check that the HPA has been created and is monitoring your application:

kubectl get hpa

# Example Output:
# NAME         REFERENCE                  TARGETS      MINPODS   MAXPODS   REPLICAS   AGE
# my-app-hpa   Deployment/hpa-demo-deployment   0%/50%       2         10        2          2m

The TARGETS column shows the current utilization versus the target. If it shows <unknown> or missing resource metric, double-check that your Metrics Server is healthy and your Deployment has resource requests defined.

For details on scaling decisions, check the events:

kubectl describe hpa my-app-hpa

This command will show the Conditions and Events sections, which explain when the HPA scaled up or down and why.

Setting Up HPA with DCGM Metrics

Autoscaling GPU-accelerated workloads in Kubernetes involves dynamically adjusting the number of Pods based on real-time utilization of the GPU resources. This process is more complex than scaling based on standard CPU or memory, as it requires setting up a dedicated Custom Metrics Pipeline to feed GPU-specific telemetry to the Horizontal Pod Autoscaler (HPA).

Here is the rephrased paragraph in a clear Markdown format, emphasizing the key components and their roles in GPU-based autoscaling:


Scaling AI/ML Workloads with GPU Metrics

To enable autoscaling for AI/ML workloads based on GPU performance, you must establish a reliable source for those specialized metrics. In this document, we will use a custom GPU metrics pipeline that leverages the NVIDIA GPU Device Plugin and DCGM (Data Center GPU Manager) to collect GPU-specific performance metrics.

ComponentRole in the Pipeline
NVIDIA GPU OperatorThe GPU Operator is an umbrella package that automates the deployment of all necessary NVIDIA components for Kubernetes. This stack includes the NVIDIA DCGM Exporter (Data Center GPU Manager Exporter).
Prometheus ServerPrometheus monitors applications running in the user clusters as well as system components running in the user clusters.
Prometheus AdapterThe Prometheus Adapter is a crucial component in Kubernetes that allows the Horizontal Pod Autoscaler (HPA) to scale workloads using custom metrics collected by Prometheus.

Install NVIDIA GPU Operator

KKP offers the possibility to install the NVIDIA GPU Operator in the user cluster, by using our application catalog for enterprise customers, or by installing it manually in the user cluster. to install the operator via our application catalog, follow the instructions here.

To install the operator manually, follow the instructions here.

Install Prometheus

We need Prometheus for the Prometheus Adapter because the Adapter relies on Prometheus as its source of metrics data. The Adapter itself does not collect metrics; its sole purpose is to translate and expose the metrics that Prometheus has already collected

The adapter should be installed where the prometheus server is running as the adapter will be configured to query the prometheus server. This can be achieved by installing the adapter in the seed cluster where the user cluster prmetheus server is running.

Another approach can be to run a prometheus server in the user cluster directly via Kubermatic custom app definition or manually running it on the cluster via helm:

# Add the Prometheus Community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

# Update your local Helm chart repository cache
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --create-namespace \
  --namespace monitoring \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set alertmanager.enabled=false # Optional: Disable Alertmanager if you don't need alerts immediately

Install Prometheus Adapter

The Prometheus Adapter is a crucial component in Kubernetes that allows the Horizontal Pod Autoscaler (HPA) to scale workloads using custom metrics collected by Prometheus.

Users can install the Prometheus Adapter in the user cluster by via helm by executing these commands:

For Helm2

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
$ helm install --name my-release prometheus-community/prometheus-adapter

For Helm3 ( as name is mandatory )

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
$ helm install my-release prometheus-community/prometheus-adapter

For more information on how to install the Prometheus Adapter, please refer to the official documentation.

Setting HPA with DCGM Metrics

Here is an example of a HPA configuration that scales based on GPU utilization. Creating a Kubernetes Deployment that utilizes an NVIDIA GPU requires two main things: ensuring your cluster has the NVIDIA Device Plugin running (a prerequisite) and specifying the GPU resource in the Pod’s manifest.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-worker-deployment
  labels:
    app: gpu-worker
spec:
  replicas: 1 # Start with 1 replica, HPA will scale this up
  selector:
    matchLabels:
      app: gpu-worker
  template:
    metadata:
      labels:
        app: gpu-worker
    spec:
      # Node Selector (Optional but recommended)
      # This ensures the Pod is only scheduled on nodes labeled to have GPUs.
      nodeSelector:
        accelerator: nvidia

      containers:
        - name: cuda-container
          image: nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04 # Use a robust NVIDIA image
          command: ["/bin/bash", "-c"]
          args: ["/usr/local/nvidia/bin/nvidia-smi; sleep infinity"] # Example command to keep the container running

          # --- GPU Resource Configuration (CRITICAL) ---
          resources:
            limits:
              # This is the line that requests a GPU resource from the cluster.
              # Replace '1' with the number of GPUs required (e.g., "0.5" if using MIG or time-slicing)
              nvidia.com/gpu: "1"

            # Requests should be identical to limits for non-sharable resources like GPUs
            requests:
              nvidia.com/gpu: "1"

Next we will configure the HPA to scale based on the GPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-util-autoscaler
  namespace: default # Ensure this matches your deployment's namespace
spec:
  # 1. Target the Deployment created previously
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: gpu-worker-deployment

  # 2. Define scaling limits
  minReplicas: 1
  maxReplicas: 5 # Define the maximum number of GPU workers

  # 3. Define the custom metric
  metrics:
    - type: Pod # Metric applies to the pods managed by the Deployment
      pod:
        metric:
          # This name MUST match the metric alias defined in your 
          # Prometheus Adapter configuration (ConfigMap)
          name: dcgm_gpu_utilization_percent
        target:
          type: AverageValue
          # Scale up if the average GPU utilization across all pods exceeds 60%
          averageValue: 60