This page serves as a reference to the alerts that a standard Kubermatic Kubernetes Platform (KKP) monitoring setup can fire, alongside a short description and steps to debug.
Under Development
Group blackbox-exporter
HttpProbeFailed warning
probe_success != 1
Probing the blackbox-exporter target {{ $labels.instance }} failed.
HttpProbeSlow warning
sum by (instance) (probe_http_duration_seconds) > 3
{{ $labels.instance }} takes {{ $value }} seconds to respond.
Remediation steps:
- Check the target system’s resource usage for anomalias.
- Check if the target application has been recently rescheduled and is still settling.
HttpCertExpiresSoon warning
probe_ssl_earliest_cert_expiry - time() < 3*24*3600
The certificate for {{ $labels.instance }} expires in less than 3 days.
HttpCertExpiresVerySoon critical
probe_ssl_earliest_cert_expiry - time() < 24*3600
The certificate for {{ $labels.instance }} expires in less than 24 hours.
Group cadvisor
CadvisorDown critical
absent(up{job="cadvisor"} == 1)
Cadvisor has disappeared from Prometheus target discovery.
Group cert-manager
CertManagerCertExpiresSoon warning
certmanager_certificate_expiration_timestamp_seconds - time() < 3*24*3600
The certificate {{ $labels.name }} expires in less than 3 days.
CertManagerCertExpiresVerySoon critical
certmanager_certificate_expiration_timestamp_seconds - time() < 24*3600
The certificate {{ $labels.name }} expires in less than 24 hours.
Group helm-exporter
HelmReleaseNotDeployed warning
helm_chart_info != 1
The Helm release {{ $labels.release }}
({{ $labels.chart }}
chart in namespace {{ $labels.exported_namespace }}
) in version {{ $labels.version }} has not been ready for more than 15 minutes.
Remediation steps:
- Check the installed Helm releases via
helm --tiller-namespace kubermtic-installer ls
. - If all releases are status
DEPLOYED
, make sure the helme-exporter is looking at the correct Tiller by checking the values.yaml
flag helmExporter.tillerNamespace
. - If Helm cannot repair the chart automatically, delete/purge the chart (
helm delete --purge [RELEASE]
) and re-install the chart again. Re-installing charts will not affect any existing data in existing PersistentVolumeClaims.
Group kube-apiserver
KubernetesApiserverDown critical
absent(up{job="apiserver"} == 1)
KubernetesApiserver has disappeared from Prometheus target discovery.
KubeAPILatencyHigh warning
cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 1
The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.
KubeAPILatencyHigh critical
cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4
The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.
KubeAPIErrorsHigh critical
sum(rate(apiserver_request_total{job="apiserver",code=~"^(?:5..)$"}[5m])) without(instance, pod)
/
sum(rate(apiserver_request_total{job="apiserver"}[5m])) without(instance, pod) * 100 > 10
API server is returning errors for {{ $value }}% of requests.
KubeAPIErrorsHigh warning
sum(rate(apiserver_request_total{job="apiserver",code=~"^(?:5..)$"}[5m])) without(instance, pod)
/
sum(rate(apiserver_request_total{job="apiserver"}[5m])) without(instance, pod) * 100 > 5
API server is returning errors for {{ $value }}% of requests.
KubeClientCertificateExpiration warning
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0
and
histogram_quantile(0.01, sum by (job, instance, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800
A client certificate used to authenticate to the apiserver is expiring in less than 7 days.
Remediation steps:
- Check the Kubernetes documentation on how to renew certificates.
- If your certificate has already expired, the steps in the documentation might not work. Check Github for hints about fixing your cluster.
KubeClientCertificateExpiration critical
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0
and
histogram_quantile(0.01, sum by (job, instance, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400
A client certificate used to authenticate to the apiserver is expiring in less than 24 hours.
Remediation steps:
- Urgently renew your certificates. Expired certificates can make fixing the cluster difficult to begin with.
- Check the Kubernetes documentation on how to renew certificates.
- If your certificate has already expired, the steps in the documentation might not work. Check Github for hints about fixing your cluster.
Group kube-kubelet
KubeletDown critical
absent(up{job="kubelet"} == 1)
Kubelet has disappeared from Prometheus target discovery.
KubePersistentVolumeUsageCritical critical
100 * kubelet_volume_stats_available_bytes{job="kubelet"}
/
kubelet_volume_stats_capacity_bytes{job="kubelet"}
< 3
The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is only {{ printf “%0.0f” $value }}% free.
KubePersistentVolumeFullInFourDays critical
(
kubelet_volume_stats_used_bytes{job="kubelet"}
/
kubelet_volume_stats_capacity_bytes{job="kubelet"}
) > 0.85
and
predict_linear(kubelet_volume_stats_available_bytes{job="kubelet"}[6h], 4 * 24 * 3600) < 0
Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value }} bytes are available.
KubeletTooManyPods warning
kubelet_running_pod_count{job="kubelet"} > 110 * 0.9
Kubelet {{ $labels.instance }} is running {{ $value }} pods, close to the limit of 110.
KubeClientErrors warning
(sum(rate(rest_client_requests_total{code=~"(5..|<error>)",job="kubelet"}[5m])) by (instance)
/
sum(rate(rest_client_requests_total{job="kubelet"}[5m])) by (instance))
* 100 > 1
The kubelet on {{ $labels.instance }} is experiencing {{ printf “%0.0f” $value }}% errors.
KubeClientErrors warning
(sum(rate(rest_client_requests_total{code=~"(5..|<error>)",job="pods"}[5m])) by (namespace, pod)
/
sum(rate(rest_client_requests_total{job="pods"}[5m])) by (namespace, pod))
* 100 > 1
The pod {{ $labels.namespace }}/{{ $labels.pod }} is experiencing {{ printf “%0.0f” $value }}% errors.
KubeletRuntimeErrors warning
sum(rate(kubelet_runtime_operations_errors_total{job="kubelet"}[5m])) by (instance) > 0.1
The kubelet on {{ $labels.instance }} is having an elevated error rate for container runtime operations.
KubeletCGroupManagerDurationHigh warning
sum(rate(kubelet_cgroup_manager_duration_seconds{quantile="0.9"}[5m])) by (instance) * 1000 > 1
The kubelet’s cgroup manager duration on {{ $labels.instance }} has been elevated ({{ printf “%0.2f” $value }}ms) for more than 15 minutes.
KubeletPodWorkerDurationHigh warning
sum(rate(kubelet_pod_worker_duration_seconds{quantile="0.9"}[5m])) by (instance, operation_type) * 1000 > 250
The kubelet’s pod worker duration for {{ $labels.operation_type }} operations on {{ $labels.instance }} has been elevated ({{ printf “%0.2f” $value }}ms) for more than 15 minutes.
KubeVersionMismatch warning
count(count(kubernetes_build_info{job!="dns"}) by (gitVersion)) > 1
There are {{ $value }} different versions of Kubernetes components running.
Group kube-state-metrics
KubeStateMetricsDown critical
absent(up{job="kube-state-metrics"} == 1)
KubeStateMetrics has disappeared from Prometheus target discovery.
KubePodCrashLooping critical
rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 0
Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf “%.2f” $value }} times / 5 minutes.
Remediation steps:
KubePodNotReady critical
sum by (namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown"}) > 0
Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than an hour.
Remediation steps:
- Check the pod via
kubectl describe pod [POD]
to find out about scheduling issues.
KubeDeploymentGenerationMismatch critical
kube_deployment_status_observed_generation{job="kube-state-metrics"}
!=
kube_deployment_metadata_generation{job="kube-state-metrics"}
Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back.
KubeDeploymentReplicasMismatch critical
kube_deployment_spec_replicas{job="kube-state-metrics"}
!=
kube_deployment_status_replicas_available{job="kube-state-metrics"}
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than an hour.
KubeStatefulSetReplicasMismatch critical
kube_statefulset_status_replicas_ready{job="kube-state-metrics"}
!=
kube_statefulset_status_replicas{job="kube-state-metrics"}
StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes.
KubeStatefulSetGenerationMismatch critical
kube_statefulset_status_observed_generation{job="kube-state-metrics"}
!=
kube_statefulset_metadata_generation{job="kube-state-metrics"}
StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back.
KubeStatefulSetUpdateNotRolledOut critical
max without (revision) (
kube_statefulset_status_current_revision{job="kube-state-metrics"}
unless
kube_statefulset_status_update_revision{job="kube-state-metrics"}
)
*
(
kube_statefulset_replicas{job="kube-state-metrics"}
!=
kube_statefulset_status_replicas_updated{job="kube-state-metrics"}
)
StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.
KubeDaemonSetRolloutStuck critical
kube_daemonset_status_number_ready{job="kube-state-metrics"}
/
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} * 100 < 100
Only {{ $value }}% of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready.
KubeDaemonSetNotScheduled warning
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}
-
kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0
{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.
KubeDaemonSetMisScheduled warning
kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0
{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.
KubeCronJobRunning warning
time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600
CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.
KubeJobCompletion warning
kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0
Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete.
KubeJobFailed warning
kube_job_status_failed{job="kube-state-metrics"} > 0
Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
KubeCPUOvercommit warning
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="requests.cpu"})
/
sum(node:node_num_cpu:sum)
> 1.5
Cluster has overcommitted CPU resource requests for namespaces.
KubeCPUOvercommit warning
sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)
/
sum(node:node_num_cpu:sum)
>
(count(node:node_num_cpu:sum)-1) / count(node:node_num_cpu:sum)
Cluster has overcommitted CPU resource requests for pods and cannot tolerate node failure.
KubeMemOvercommit warning
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="requests.memory"})
/
sum(node_memory_MemTotal_bytes{app="node-exporter"})
> 1.5
Cluster has overcommitted memory resource requests for namespaces.
KubeMemOvercommit warning
sum(namespace_name:kube_pod_container_resource_requests_memory_bytes:sum)
/
sum(node_memory_MemTotal_bytes)
>
(count(node:node_num_cpu:sum)-1)
/
count(node:node_num_cpu:sum)
Cluster has overcommitted memory resource requests for pods and cannot tolerate node failure.
KubeQuotaExceeded warning
100 * kube_resourcequota{job="kube-state-metrics", type="used"}
/ ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
> 90
Namespace {{ $labels.namespace }} is using {{ printf “%0.0f” $value }}% of its {{ $labels.resource }} quota.
KubePodOOMKilled warning
(kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 30m >= 2)
and
ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[30m]) == 1
Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 30 minutes.
KubeNodeNotReady warning
kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0
{{ $labels.node }} has been unready for more than an hour.
Group node-exporter
NodeFilesystemSpaceFillingUp warning
predict_linear(node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"}[6h], 24*60*60) < 0
and
node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} < 0.4
and
node_filesystem_readonly_bytes{app="node-exporter",fstype=~"ext.|xfs"} == 0
Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of space within the next 24 hours.
NodeFilesystemSpaceFillingUp critical
predict_linear(node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"}[6h], 4*60*60) < 0
and
node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} < 0.2
and
node_filesystem_readonly_bytes{app="node-exporter",fstype=~"ext.|xfs"} == 0
Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of space within the next 4 hours.
NodeFilesystemOutOfSpace warning
node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 5
and
node_filesystem_readonly_bytes{app="node-exporter",fstype=~"ext.|xfs"} == 0
Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ $value }}% available space left.
NodeFilesystemOutOfSpace critical
node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 3
and
node_filesystem_readonly_bytes{app="node-exporter",fstype=~"ext.|xfs"} == 0
Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ $value }}% available space left.
NodeFilesystemFilesFillingUp warning
predict_linear(node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"}[6h], 24*60*60) < 0
and
node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} < 0.4
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0
Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of files within the next 24 hours.
NodeFilesystemFilesFillingUp warning
predict_linear(node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"}[6h], 4*60*60) < 0
and
node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} < 0.2
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0
Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of files within the next 4 hours.
NodeFilesystemOutOfFiles warning
node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 5
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0
Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ $value }}% available inodes left.
NodeFilesystemOutOfSpace critical
node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 3
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0
Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ $value }}% available space left.
NodeNetworkReceiveErrs critical
increase(node_network_receive_errs_total[2m]) > 10
{{ $labels.instance }} interface {{ $labels.device }} shows errors while receiving packets ({{ $value }} errors in two minutes).
NodeNetworkTransmitErrs critical
increase(node_network_transmit_errs_total[2m]) > 10
{{ $labels.instance }} interface {{ $labels.device }} shows errors while transmitting packets ({{ $value }} errors in two minutes).
Group prometheus
PromScrapeFailed warning
up != 1
Prometheus failed to scrape a target {{ $labels.job }} / {{ $labels.instance }}.
Remediation steps:
- Check the Prometheus Service Discovery page to find out why the target is unreachable.
PromBadConfig critical
prometheus_config_last_reload_successful{job="prometheus"} == 0
Prometheus failed to reload config.
Remediation steps:
- Check Prometheus pod’s logs via
kubectl -n monitoring logs prometheus-0
and -1
. - Check the
prometheus-rules
configmap via kubectl -n monitoring get configmap prometheus-rules -o yaml
.
PromAlertmanagerBadConfig critical
alertmanager_config_last_reload_successful{job="alertmanager"} == 0
Alertmanager failed to reload config.
Remediation steps:
- Check Alertmanager pod’s logs via
kubectl -n monitoring logs alertmanager-0
, -1
and -2
. - Check the
alertmanager
secret via kubectl -n monitoring get secret alertmanager -o yaml
.
PromAlertsFailed critical
sum(increase(alertmanager_notifications_failed_total{job="alertmanager"}[5m])) by (namespace) > 0
Alertmanager failed to send an alert.
Remediation steps:
- Check Prometheus pod’s logs via
kubectl -n monitoring logs prometheus-0
and -1
. - Make sure the Alertmanager StatefulSet is running:
kubectl -n monitoring get pods
.
PromRemoteStorageFailures critical
(rate(prometheus_remote_storage_failed_samples_total{job="prometheus"}[1m]) * 100)
/
(rate(prometheus_remote_storage_failed_samples_total{job="prometheus"}[1m]) + rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus"}[1m]))
> 1
Prometheus failed to send {{ printf “%.1f” $value }}% samples.
Remediation steps:
- Ensure that the Prometheus volume has not reached capacity.
- Check Prometheus pod’s logs via
kubectl -n monitoring logs prometheus-0
and -1
.
PromRuleFailures critical
rate(prometheus_rule_evaluation_failures_total{job="prometheus"}[1m]) > 0
Prometheus failed to evaluate {{ printf “%.1f” $value }} rules/sec.
Remediation steps:
- Check Prometheus pod’s logs via
kubectl -n monitoring logs prometheus-0
and -1
. - Check CPU/memory pressure on the node.
Group thanos
ThanosSidecarDown warning
thanos_sidecar_prometheus_up != 1
The Thanos sidecar in {{ $labels.namespace }}/{{ $labels.pod }}
is down.
ThanosSidecarNoHeartbeat warning
time() - thanos_sidecar_last_heartbeat_success_time_seconds > 60
The Thanos sidecar in {{ $labels.namespace }}/{{ $labels.pod }}
didn’t send a heartbeat in {{ $value }} seconds.
ThanosCompactorManyRetries warning
sum(rate(thanos_compact_retries_total[5m])) > 0.01
The Thanos compactor in {{ $labels.namespace }}
is experiencing a high retry rate.
Remediation steps:
- Check the
thanos-compact
pod’s logs.
ThanosShipperManyDirSyncFailures warning
sum(rate(thanos_shipper_dir_sync_failures_total[5m])) > 0.01
The Thanos shipper in {{ $labels.namespace }}/{{ $labels.pod }}
is experiencing a high dir-sync failure rate.
Remediation steps:
- Check the
thanos
containers’s logs inside the Prometheus pod.
ThanosManyPanicRecoveries warning
sum(rate(thanos_grpc_req_panics_recovered_total[5m])) > 0.01
The Thanos component in {{ $labels.namespace }}/{{ $labels.pod }}
is experiencing a panic recovery rate.
ThanosManyBlockLoadFailures warning
sum(rate(thanos_bucket_store_block_load_failures_total[5m])) > 0.01
The Thanos store in {{ $labels.namespace }}/{{ $labels.pod }}
is experiencing a many failed block loads.
ThanosManyBlockDropFailures warning
sum(rate(thanos_bucket_store_block_drop_failures_total[5m])) > 0.01
The Thanos store in {{ $labels.namespace }}/{{ $labels.pod }}
is experiencing a many failed block drops.
Group velero
time() - velero_backup_last_successful_timestamp{schedule!=""} > 3600
Last backup with schedule {{ $labels.schedule }} has not finished successfully within 60min.
Remediation steps:
- Check if a backup is really in “InProgress” state via
velero -n velero backup get
. - Check the backup logs via
velero -n velero backup logs [BACKUP_NAME]
. - Depending on the backup, find the pod and check the processes inside that pod or any sidecar containers.
VeleroNoRecentBackup critical
time() - velero_backup_last_successful_timestamp{schedule!=""} > 3600*25
There has not been a successful backup for schedule {{ $labels.schedule }} in the last 24 hours.
Remediation steps:
- Check if really no backups happened via
velero -n velero backup get
. - If a backup failed, check its logs via
velero -n velero backup logs [BACKUP_NAME]
. - If a backup was not even triggered, check the Velero server’s logs via
kubectl -n velero logs -l 'name=velero-server'
. - Make sure the Velero server pod has not been rescheduled and possibly opt to schedule it on a stable node using a node affinity.
Group kubermatic
KubermaticAPIDown critical
absent(up{job="pods",namespace="kubermatic",app_kubernetes_io_name="kubermatic-api"} == 1)
KubermaticAPI has disappeared from Prometheus target discovery.
Remediation steps:
- Check the Prometheus Service Discovery page to find out why the target is unreachable.
- Ensure that the API pod’s logs and that it is not crashlooping.
KubermaticAPITooManyErrors warning
sum(rate(http_requests_total{app_kubernetes_io_name="kubermatic-api",code=~"5.."}[5m])) > 0.1
Kubermatic API is returning a high rate of HTTP 5xx responses.
Remediation steps:
- Check the API pod’s logs.
KubermaticAPITooManyInitNodeDeloymentFailures warning
sum(rate(kubermatic_api_init_node_deployment_failures[5m])) > 0.01
Kubermatic API is failing to create too many initial node deployments.
KubermaticMasterControllerManagerDown critical
absent(up{job="pods",namespace="kubermatic",app_kubernetes_io_name="kubermatic-master-controller-manager"} == 1)
Kubermatic Master Controller Manager has disappeared from Prometheus target discovery.
Remediation steps:
- Check the Prometheus Service Discovery page to find out why the target is unreachable.
- Ensure that the master-controller-manager pod’s logs and that it is not crashlooping.
Group kubermatic
KubermaticTooManyUnhandledErrors warning
sum(rate(kubermatic_controller_manager_unhandled_errors_total[5m])) > 0.01
Kubermatic controller manager in {{ $labels.namespace }} is experiencing too many errors.
Remediation steps:
- Check the controller-manager pod’s logs.
(time() - max by (cluster) (kubermatic_cluster_deleted)) > 30*60
Cluster {{ $labels.cluster }} is stuck in deletion for more than 30min.
Remediation steps:
- Check the machine-controller’s logs via
kubectl -n cluster-XYZ logs -l 'app=machine-controller'
for errors related to cloud provider integrations. Expired credentials or manually deleted cloud provider resources are common reasons for failing deletions. - Check the cluster’s status itself via
kubectl describe cluster XYZ
. - If all resources have been cleaned up, remove the blocking finalizer (e.g.
kubermatic.io/delete-nodes
) from the cluster resource. - If nothing else helps, manually delete the cluster namespace as a last resort.
(time() - max by (cluster,addon) (kubermatic_addon_deleted)) > 30*60
Addon {{ $labels.addon }} in cluster {{ $labels.cluster }} is stuck in deletion for more than 30min.
Remediation steps:
- Check the kubermatic controller-manager’s logs via
kubectl -n kubermatic logs -l 'app.kubernetes.io/name=kubermatic-seed-controller-manager'
for errors related to deletion of the addon. Manually deleted resources inside of the user cluster is a common reason for failing deletions. - If all resources of the addon inside the user cluster have been cleaned up, remove the blocking finalizer (e.g.
cleanup-manifests
) from the addon resource.
kubermatic_addon_reconcile_failed * on(cluster) group_left() kubermatic_cluster_created
- kubermatic_addon_reconcile_failed * on(cluster) group_left() kubermatic_cluster_deleted
> 0
Addon {{ $labels.addon }} in cluster {{ $labels.cluster }} has no related resources created for more than 30min.
Remediation steps:
- Check the kubermatic seed controller-manager’s logs via
kubectl -n kubermatic logs -l 'app.kubernetes.io/name=kubermatic-seed-controller-manager'
for errors related to reconciliation of the addon.
KubermaticSeedControllerManagerDown critical
absent(up{job="pods",namespace="kubermatic",app_kubernetes_io_name="kubermatic-seed-controller-manager"} == 1)
Kubermatic Seed Controller Manager has disappeared from Prometheus target discovery.
Remediation steps:
- Check the Prometheus Service Discovery page to find out why the target is unreachable.
- Ensure that the seed-controller-manager pod’s logs and that it is not crashlooping.
OpenVPNServerDown critical
absent(kube_deployment_status_replicas_available{cluster!="",deployment="openvpn-server"} > 0) and count(kubermatic_cluster_info) > 0
There is no healthy OpenVPN server in cluster {{ $labels.cluster }}.
UserClusterPrometheusAbsent critical
(
kubermatic_cluster_info * on (name) group_left
label_replace(up{job="clusters"}, "name", "$1", "namespace", "cluster-(.+)")
or
kubermatic_cluster_info * 0
) == 0
There is no Prometheus in cluster {{ $labels.name }}.
KubermaticClusterPaused informational
label_replace(kubermatic_cluster_info{pause="true"}, "cluster", "$0", "name", ".+")
Cluster {{ $labels.name }} has been paused and will not be reconciled until the pause flag is reset.
Group kube-controller-manager
KubeControllerManagerDown critical
absent(:ready_kube_controller_managers:sum) or :ready_kube_controller_managers:sum == 0
No healthy controller-manager pods exist inside the cluster.
Group kube-scheduler
KubeSchedulerDown critical
absent(:ready_kube_schedulers:sum) or :ready_kube_schedulers:sum == 0
No healthy scheduler pods exist inside the cluster.