The Kubernetes Horizontal Pod Autoscaler (HPA) is a fundamental component of Kubernetes that automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed resource utilization or other custom metrics.
The Horizontal Pod Autoscaler is a built-in feature of Kubernetes, so there is no separate “installation” required for the controller itself. However, it relies on the Metrics Server to function correctly.
The Metrics Server is a crucial component that collects resource usage data (CPU, memory) from all nodes and pods, which the HPA then uses to make scaling decisions.
Note: You can install the Metrics Server and the whole MLA stack in KKP by enabling User Cluster Monitoring checkbox in the Cluster settings. More information can be found here.
Once running, you can test it by checking if you can retrieve node and pod metrics:
kubectl top nodes
kubectl top pods
The HPA scales based on a percentage of the defined resource requests. If your Deployment does not have CPU requests defined, the HPA will not be able to function based on CPU utilization.
Ensure your workload’s YAML file (Deployment, ReplicaSet, etc.) includes a resources: requests block:
# Snippet from your Deployment YAML
spec:
template:
spec:
containers:
- name: my-container
image: k8s.gcr.io/hpa-example # A simple example image
resources:
requests:
cpu: "200m" # 200 milliCPU (0.2 CPU core)
limits:
cpu: "500m" # Optional, but recommended
You can deploy the HPA using either a simple command or a declarative YAML file.
kubectl autoscale Command (Quick Method)This is the fastest way to create an HPA resource:
kubectl autoscale deployment [DEPLOYMENT_NAME] \
--cpu-percent=50 \
--min=2 \
--max=10
[DEPLOYMENT_NAME]: Replace this with the actual name of your Deployment.--cpu-percent=50: The HPA will try to maintain an average CPU utilization of 50% across all pods.--min=2: The minimum number of replicas.--max=10: The maximum number of replicas.For complex configurations (like scaling on memory or custom metrics), a YAML manifest is better. We recommend using the autoscaling/v2 API version for the latest features.
hpa-config.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
# Target the resource that needs to be scaled
apiVersion: apps/v1
kind: Deployment
name: hpa-demo-deployment # <-- REPLACE with your Deployment name
minReplicas: 2
maxReplicas: 10
metrics:
# Metric 1: Scale based on CPU utilization
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # Target average 50% CPU utilization
# Metric 2: Scale based on Memory utilization (optional)
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 300Mi # Target average of 300 MiB of memory usage
Apply the HPA:
kubectl apply -f hpa-config.yaml
Check that the HPA has been created and is monitoring your application:
kubectl get hpa
# Example Output:
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
# my-app-hpa Deployment/hpa-demo-deployment 0%/50% 2 10 2 2m
The TARGETS column shows the current utilization versus the target. If it shows <unknown> or missing resource metric, double-check that your Metrics Server is healthy and your Deployment has resource requests defined.
For details on scaling decisions, check the events:
kubectl describe hpa my-app-hpa
This command will show the Conditions and Events sections, which explain when the HPA scaled up or down and why.
Autoscaling GPU-accelerated workloads in Kubernetes involves dynamically adjusting the number of Pods based on real-time utilization of the GPU resources. This process is more complex than scaling based on standard CPU or memory, as it requires setting up a dedicated Custom Metrics Pipeline to feed GPU-specific telemetry to the Horizontal Pod Autoscaler (HPA).
Here is the rephrased paragraph in a clear Markdown format, emphasizing the key components and their roles in GPU-based autoscaling:
To enable autoscaling for AI/ML workloads based on GPU performance, you must establish a reliable source for those specialized metrics. In this document, we will use a custom GPU metrics pipeline that leverages the NVIDIA GPU Device Plugin and DCGM (Data Center GPU Manager) to collect GPU-specific performance metrics.
| Component | Role in the Pipeline |
|---|---|
| NVIDIA GPU Operator | The GPU Operator is an umbrella package that automates the deployment of all necessary NVIDIA components for Kubernetes. This stack includes the NVIDIA DCGM Exporter (Data Center GPU Manager Exporter). |
| Prometheus Server | Prometheus monitors applications running in the user clusters as well as system components running in the user clusters. |
| Prometheus Adapter | The Prometheus Adapter is a crucial component in Kubernetes that allows the Horizontal Pod Autoscaler (HPA) to scale workloads using custom metrics collected by Prometheus. |
KKP offers the possibility to install the NVIDIA GPU Operator in the user cluster, by using our application catalog for enterprise customers, or by installing it manually in the user cluster. to install the operator via our application catalog, follow the instructions here.
To install the operator manually, follow the instructions here.
We need Prometheus for the Prometheus Adapter because the Adapter relies on Prometheus as its source of metrics data. The Adapter itself does not collect metrics; its sole purpose is to translate and expose the metrics that Prometheus has already collected
The adapter should be installed where the prometheus server is running as the adapter will be configured to query the prometheus server. This can be achieved by installing the adapter in the seed cluster where the user cluster prmetheus server is running.
Another approach can be to run a prometheus server in the user cluster directly via Kubermatic custom app definition or manually running it on the cluster via helm:
# Add the Prometheus Community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# Update your local Helm chart repository cache
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--create-namespace \
--namespace monitoring \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set alertmanager.enabled=false # Optional: Disable Alertmanager if you don't need alerts immediately
The Prometheus Adapter is a crucial component in Kubernetes that allows the Horizontal Pod Autoscaler (HPA) to scale workloads using custom metrics collected by Prometheus.
Users can install the Prometheus Adapter in the user cluster by via helm by executing these commands:
For Helm2
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
$ helm install --name my-release prometheus-community/prometheus-adapter
For Helm3 ( as name is mandatory )
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
$ helm install my-release prometheus-community/prometheus-adapter
For more information on how to install the Prometheus Adapter, please refer to the official documentation.
Here is an example of a HPA configuration that scales based on GPU utilization. Creating a Kubernetes Deployment that utilizes an NVIDIA GPU requires two main things: ensuring your cluster has the NVIDIA Device Plugin running (a prerequisite) and specifying the GPU resource in the Pod’s manifest.
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-worker-deployment
labels:
app: gpu-worker
spec:
replicas: 1 # Start with 1 replica, HPA will scale this up
selector:
matchLabels:
app: gpu-worker
template:
metadata:
labels:
app: gpu-worker
spec:
# Node Selector (Optional but recommended)
# This ensures the Pod is only scheduled on nodes labeled to have GPUs.
nodeSelector:
accelerator: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04 # Use a robust NVIDIA image
command: ["/bin/bash", "-c"]
args: ["/usr/local/nvidia/bin/nvidia-smi; sleep infinity"] # Example command to keep the container running
# --- GPU Resource Configuration (CRITICAL) ---
resources:
limits:
# This is the line that requests a GPU resource from the cluster.
# Replace '1' with the number of GPUs required (e.g., "0.5" if using MIG or time-slicing)
nvidia.com/gpu: "1"
# Requests should be identical to limits for non-sharable resources like GPUs
requests:
nvidia.com/gpu: "1"
Next we will configure the HPA to scale based on the GPU utilization:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: gpu-util-autoscaler
namespace: default # Ensure this matches your deployment's namespace
spec:
# 1. Target the Deployment created previously
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gpu-worker-deployment
# 2. Define scaling limits
minReplicas: 1
maxReplicas: 5 # Define the maximum number of GPU workers
# 3. Define the custom metric
metrics:
- type: Pod # Metric applies to the pods managed by the Deployment
pod:
metric:
# This name MUST match the metric alias defined in your
# Prometheus Adapter configuration (ConfigMap)
name: dcgm_gpu_utilization_percent
target:
type: AverageValue
# Scale up if the average GPU utilization across all pods exceeds 60%
averageValue: 60