This is a guide on how to deploy and configure Gateway API Inference Extension with kgateway v2.1.1 in a KKP User Cluster. It will use the Inference Extension which enables intelligent routing for AI inference workloads and provides model aware load balancing and request routing.
Before getting started we can have a quick overview of the inference routing architecture from a high level. This is not KKP specific. The main components are:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml
Verify the CRDs were installed and are present in the UC:
kubectl get crd | grep gateway
You should see the following CRDs installed in the UC:
gatewayclasses.gateway.networking.k8s.io
gateways.gateway.networking.k8s.io
httproutes.gateway.networking.k8s.io
grpcroutes.gateway.networking.k8s.io
referencegrants.gateway.networking.k8s.io
kgateway will require additional policy CRDs
helm upgrade -i \
--create-namespace \
--namespace kgateway-system \
--version v2.1.0 \
kgateway-crds \
oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds
Verify the CRDs are installed:
kubectl get crd | grep kgateway
Expected output:
backendconfigpolicies.gateway.kgateway.dev
backends.gateway.kgateway.dev
directresponses.gateway.kgateway.dev
gatewayextensions.gateway.kgateway.dev
gatewayparameters.gateway.kgateway.dev
httplistenerpolicies.gateway.kgateway.dev
trafficpolicies.gateway.kgateway.dev
We will install kgateway v2.1.1 (which is the latest at the time of writing this guide) and it will be installed with inference extension enabled (this is very important).
helm upgrade -i \
--create-namespace \
--namespace kgateway-system \
--version v2.1.1 \
--set inferenceExtension.enabled=true \
kgateway \
oci://cr.kgateway.dev/kgateway-dev/charts/kgateway
Check GatewayClass:
kubectl get gatewayclass
Expected:
NAME CONTROLLER ACCEPTED AGE
kgateway kgateway.dev/kgateway True 30s
Install the Inference Extension CRDs:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.1.0/manifests.yaml
kubectl get crd | grep inference
Expected output:
inferenceobjectives.inference.networking.x-k8s.io
inferencepools.inference.networking.k8s.io
inferencepools.inference.networking.x-k8s.io
We will create a namespace for the AI workloads:
kubectl create namespace ai-inference
For TLS certificates, we can use Cert Manager and set it up in KKP. You can create an application installation with cert-manager as an application and the KKP application controller will install cert-manager in your cluster.
You can also generate a self signed TLS certificate for testing for the Gateway. The steps would require specifying a CN and also creating the Kubernetes secret with the certificate called ai-gateway-tls. For production, the use of cert-manager is recommended.
Deploy server with the specific labels. Choose the option that matches your hardware. For testing we can chose llama.cpp, but you can deploy any model you would like based on your hardware. The next steps can be followed with minor tweaks if you chose another model server.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-cpp-inference
namespace: ai-inference
spec:
replicas: 3
selector:
matchLabels:
app: llama-cpp-inference
template:
metadata:
labels:
app: llama-cpp-inference
spec:
containers:
- name: llama-cpp
image: ghcr.io/ggerganov/llama.cpp:server
args:
- --hf-repo
- bartowski/Llama-3.2-1B-Instruct-GGUF
- --hf-file
- Llama-3.2-1B-Instruct-Q4_K_M.gguf
- --host
- "0.0.0.0"
- --port
- "8000"
- -c
- "2048"
- --n-gpu-layers
- "0"
- -t
- "20"
ports:
- containerPort: 8000
name: http
resources:
requests:
cpu: "15"
memory: "4Gi"
limits:
cpu: "20"
memory: "6Gi"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 30
Create the deployment:
kubectl apply -f llama-cpp-deployment.yaml
kubectl wait --for=condition=ready pod -l app=llama-cpp-inference -n ai-inference
You can also chose a vLLM deployment, but the llama.cpp one should serve as a good starting point for testing the feature in our KKP cluster. This is an example deployment with vLLM.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-model
namespace: ai-inference
spec:
replicas: 3
selector:
matchLabels:
app: vllm-model
template:
metadata:
labels:
app: vllm-model
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- meta-llama/Llama-3.2-1B-Instruct
- --dtype
- auto
- --api-key
- token-abc123
ports:
- containerPort: 8000
name: http
env:
- name: HUGGING_FACE_HUB_TOKEN
value: <your value>
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
Create the deployment:
kubectl apply -f vllm-deployment.yaml
Wait for pods to be ready:
kubectl wait --for=condition=ready pod -l app=vllm-model -n ai-inference --timeout=300s
Create the Gateway resource with HTTPS listener and TLS:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: ai-gateway
namespace: ai-inference
spec:
gatewayClassName: kgateway
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
mode: Terminate
certificateRefs:
- name: ai-gateway-tls
allowedRoutes:
namespaces:
from: Same
Apply the spec of the Gateway resource:
kubectl apply -f gateway.yaml
Wait for the Gateway to be programmed and receive an external IP:
kubectl wait --for=condition=Programmed gateway/ai-gateway -n ai-inference
Get the Gateway external IP:
kubectl get gateway ai-gateway -n ai-inference
Expected output:
NAME CLASS ADDRESS PROGRAMMED AGE
ai-gateway kgateway <should not be blank> True 2m
Why TLS is Required:
TLS is not optional for the Inference Extension. The ext_proc filter in Envoy strips the Content-Length header when reading request bodies to extract model names.
For llama.cpp:
helm install llama-cpp-picker \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
--version v1.1.0 \
--namespace ai-inference \
--set inferencePool.modelServers.matchLabels.app=llama-cpp-inference \
--set provider.name=kgateway
For vLLM:
helm install vllm-model-picker \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
--version v1.1.0 \
--namespace ai-inference \
--set inferencePool.modelServers.matchLabels.app=vllm-model \
--set provider.name=kgateway
Verify the Endpoint Picker deployment:
# For llama.cpp
kubectl get deployment,pod -n ai-inference -l app.kubernetes.io/name=llama-cpp-picker-epp
# For vLLM
kubectl get deployment,pod -n ai-inference -l app.kubernetes.io/name=vllm-model-picker-epp
Check the InferencePool resource
kubectl get inferencepool -n ai-inference
Should output:
NAME AGE
llama-cpp-picker 30s
InferencePool details:
# For llama.cpp
kubectl describe inferencepool llama-cpp-picker -n ai-inference
# For vLLM
kubectl describe inferencepool vllm-model-picker -n ai-inference
If you encounter an error like “InferencePool not found” in the HTTPRoute status after installation you should restart kgateway:
kubectl rollout restart deployment kgateway -n kgateway-system
kubectl rollout status deployment kgateway -n kgateway-system
The issue can be a timing issue where kgateway starts before the InferencePool CRD is registered.
You should update the name field in backendRefs to match your Endpoint Picker installation.
For llama.cpp:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: ai-inference-route
namespace: ai-inference
spec:
parentRefs:
- name: ai-gateway
sectionName: https
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: llama-cpp-picker
port: 8000
For vLLM:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: ai-inference-route
namespace: ai-inference
spec:
parentRefs:
- name: ai-gateway
sectionName: https
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: vllm-model-picker
port: 8000
Apply the HTTPRoute:
kubectl apply -f httproute.yaml
Check the HTTPRoute status:
kubectl get httproute ai-inference-route -n ai-inference
The route should show ACCEPTED status with parent references properly resolved. If you see ResolvedRefs: False with “InferencePool not found” you should check the Troubleshooting section below.
Test the /v1/models endpoint to verify basic connectivity:
GATEWAY_IP=$(kubectl get gateway ai-gateway -n ai-inference -o jsonpath='{.status.addresses[0].value}')
curl -k https://${GATEWAY_IP}/v1/models \
-H "Host: ai.example.com"
Expected response for llama.cpp:
{
"object": "list",
"data": [
{
"id": "/root/.cache/llama.cpp/bartowski_Llama-3.2-1B-Instruct-GGUF_Llama-3.2-1B-Instruct-Q4_K_M.gguf",
"object": "model",
"created": 1730000000,
"owned_by": "llamacpp",
"meta": {
"n_ctx_train": 131072,
"n_embd": 2048,
"n_params": 1235814432,
"size": 799862912
}
}
]
}
Test the inference routing with a chat completion request. Use the appropriate model ID based on your deployment:
For llama.cpp:
curl -k https://${GATEWAY_IP}/v1/chat/completions \
-H "Host: ai.example.com" \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/llama.cpp/bartowski_Llama-3.2-1B-Instruct-GGUF_Llama-3.2-1B-Instruct-Q4_K_M.gguf",
"messages": [
{
"role": "user",
"content": "Say random fact of the day"
}
],
"max_tokens": 100
}' | jq -r '.choices[0].message.content'
You should receive a response with the model’s answer with a random fact of the day.
Send multiple requests to verify load balancing across pods.
For llama.cpp:
for i in {1..10}; do
echo "Request $i:"
curl -sk https://${GATEWAY_IP}/v1/chat/completions \
-H "Host: ai.example.com" \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/llama.cpp/bartowski_Llama-3.2-1B-Instruct-GGUF_Llama-3.2-1B-Instruct-Q4_K_M.gguf",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10
}' | jq -r '.choices[0].message.content'
sleep 1
done
All requests should complete successfully, demonstrating that traffic is being routed through the Endpoint Picker to healthy pods.
If the Gateway remains in Pending state:
Check kgateway logs:
kubectl logs -n kgateway-system deployment/kgateway --tail=100
Check CRDs installation:
kubectl get crd | grep -E '(gateway|kgateway|inference)'
Kgateway pod should be running and ready:
kubectl get pods -n kgateway-system
If HTTPRoute shows NOT ACCEPTED:
kubectl describe httproute ai-inference-route -n ai-inference
If requests fail with connection errors:
Verify server pods are running:
kubectl get pods -n ai-inference -l app=llama-cpp-inference
Check logs:
kubectl logs -n ai-inference -l app=llama-cpp-inference --tail=100
Verify InferencePool can discover pods:
# Check InferencePool status
kubectl get inferencepool llama-cpp-picker -n ai-inference -o yaml
# Verify pods match the InferencePool selector
kubectl get pods -n ai-inference -l app=llama-cpp-inference -o wide
Test direct pod access:
# For llama.cpp
kubectl port-forward -n ai-inference deployment/llama-cpp-inference 8000:8000
# In another terminal:
curl http://localhost:8000/v1/models
The Inference Extension ext_proc filter strips the Content-Length header. If your model server requires Content-Length you would need to ensure you’re using HTTPS, and also need to verify the Gateway has the right TLS configuration.
This guide was tested with:
Tested Inference Servers: