Runbook

This page serves as a reference to the alerts that a standard Kubermatic Kubernetes Platform (KKP) monitoring setup can fire, alongside a short description and steps to debug.

Group blackbox-exporter

HttpProbeFailed warning

probe_success != 1

Probing the blackbox-exporter target {{ $labels.instance }} failed.

HttpProbeSlow warning

sum by (instance) (probe_http_duration_seconds) > 3

{{ $labels.instance }} takes {{ $value }} seconds to respond.

Remediation steps:

Check the target system’s resource usage for anomalias.
Check if the target application has been recently rescheduled and is still settling.

HttpCertExpiresSoon warning

probe_ssl_earliest_cert_expiry - time() < 3*24*3600

The certificate for {{ $labels.instance }} expires in less than 3 days.

HttpCertExpiresVerySoon critical

probe_ssl_earliest_cert_expiry - time() < 24*3600

The certificate for {{ $labels.instance }} expires in less than 24 hours.

Group cadvisor

CadvisorDown critical

absent(up{job="cadvisor"} == 1)

Cadvisor has disappeared from Prometheus target discovery.

Group cert-manager

CertManagerCertExpiresSoon warning

certmanager_certificate_expiration_timestamp_seconds - time() < 3*24*3600

The certificate {{ $labels.name }} expires in less than 3 days.

CertManagerCertExpiresVerySoon critical

certmanager_certificate_expiration_timestamp_seconds - time() < 24*3600

The certificate {{ $labels.name }} expires in less than 24 hours.

Group elasticsearch

ElasticsearchHeapTooHigh warning

elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.9

The heap usage of Elasticsearch node {{ $labels.name }} is over 90%.

Remediation steps:

Check the pod’s logs for anomalities.
If it is a data node, check the shard allocation via http://es-data:9200/_cat/shards?v.

ElasticsearchClusterUnavailable warning

elasticsearch_cluster_health_up == 0

The Elasticsearch cluster health endpoint does not respond to scrapes.

ElasticsearchClusterUnhealthy critical

elasticsearch_cluster_health_status{color="green"} == 0

The Elasticsearch cluster is not healthy.

ElasticsearchUnassignedShards critical

elasticsearch_cluster_health_unassigned_shards > 0

There are {{ $value }} unassigned shards in the Elasticsearch cluster.

Remediation steps:

Check the shard allocation via http://es-data:9200/_cat/shards?v.

Group fluentbit

FluentbitManyFailedRetries warning

sum by (namespace, pod, node) (kube_pod_info) *
  on (namespace, pod)
  group_right (node)
  rate(fluentbit_output_retries_failed_total[1m]) > 0

Fluentbit pod {{ $labels.pod }} on {{ $labels.node }} is experiencing an elevated failed retry rate.

Remediation steps:

Ensure the target Elasticsearch cluster is healthy and accepts new documents (in certain conditions Elasticsearch clusters become read-only).
Ensure that Retry_Limit is not set to False (infinite) to prevent unprocessable logs from stopping the ingestion of new logs.

FluentbitManyOutputErrors warning

sum by (namespace, pod, node) (kube_pod_info) *
  on (namespace, pod)
  group_right (node)
  rate(fluentbit_output_errors_total[1m]) > 0

Fluentbit pod {{ $labels.pod }} on {{ $labels.node }} is experiencing an elevated output error rate.

Remediation steps:

Ensure the target Elasticsearch cluster is healthy and accepts new documents (in certain conditions Elasticsearch clusters become read-only).
Ensure that Retry_Limit is not set to False (infinite) to prevent unprocessable logs from stopping the ingestion of new logs.

FluentbitNotProcessingNewLogs warning

sum by (namespace, pod, node) (kube_pod_info) *
  on (namespace, pod)
  group_right (node)
  rate(fluentbit_output_proc_records_total[1m]) == 0

Fluentbit pod {{ $labels.pod }} on {{ $labels.node }} has not processed any new logs for the last 30 minutes.

Remediation steps:

Check if there are no other log-generating pods running on the same node.

Group helm-exporter

HelmReleaseNotDeployed warning

helm_chart_info != 1

The Helm release {{ $labels.release }} ({{ $labels.chart }} chart in namespace {{ $labels.exported_namespace }}) in version {{ $labels.version }} has not been ready for more than 15 minutes.

Remediation steps:

Check the installed Helm releases via helm --tiller-namespace kubermtic-installer ls.
If all releases are status DEPLOYED, make sure the helme-exporter is looking at the correct Tiller by checking the values.yaml flag helmExporter.tillerNamespace.
If Helm cannot repair the chart automatically, delete/purge the chart (helm delete --purge [RELEASE]) and re-install the chart again. Re-installing charts will not affect any existing data in existing PersistentVolumeClaims.

Group kube-apiserver

KubernetesApiserverDown critical

absent(up{job="apiserver"} == 1)

KubernetesApiserver has disappeared from Prometheus target discovery.

KubeAPILatencyHigh warning

cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 1

The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.

KubeAPILatencyHigh critical

cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4

The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.

KubeAPIErrorsHigh critical

sum(rate(apiserver_request_total{job="apiserver",code=~"^(?:5..)$"}[5m])) without(instance, pod)
  /
sum(rate(apiserver_request_total{job="apiserver"}[5m])) without(instance, pod) * 100 > 10

API server is returning errors for {{ $value }}% of requests.

KubeAPIErrorsHigh warning

sum(rate(apiserver_request_total{job="apiserver",code=~"^(?:5..)$"}[5m])) without(instance, pod)
  /
sum(rate(apiserver_request_total{job="apiserver"}[5m])) without(instance, pod) * 100 > 5

API server is returning errors for {{ $value }}% of requests.

KubeClientCertificateExpiration warning

apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0
and
histogram_quantile(0.01, sum by (job, instance, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800

A client certificate used to authenticate to the apiserver is expiring in less than 7 days.

Remediation steps:

Check the Kubernetes documentation on how to renew certificates.
If your certificate has already expired, the steps in the documentation might not work. Check Github for hints about fixing your cluster.

KubeClientCertificateExpiration critical

apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0
and
histogram_quantile(0.01, sum by (job, instance, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400

A client certificate used to authenticate to the apiserver is expiring in less than 24 hours.

Remediation steps:

Urgently renew your certificates. Expired certificates can make fixing the cluster difficult to begin with.
Check the Kubernetes documentation on how to renew certificates.
If your certificate has already expired, the steps in the documentation might not work. Check Github for hints about fixing your cluster.

Group kube-kubelet

KubeletDown critical

absent(up{job="kubelet"} == 1)

Kubelet has disappeared from Prometheus target discovery.

KubePersistentVolumeUsageCritical critical

100 * kubelet_volume_stats_available_bytes{job="kubelet"}
  /
kubelet_volume_stats_capacity_bytes{job="kubelet"}
  < 3

The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is only {{ printf “%0.0f” $value }}% free.

KubePersistentVolumeFullInFourDays critical

(
  kubelet_volume_stats_used_bytes{job="kubelet"}
    /
  kubelet_volume_stats_capacity_bytes{job="kubelet"}
) > 0.85
and
predict_linear(kubelet_volume_stats_available_bytes{job="kubelet"}[6h], 4 * 24 * 3600) < 0

Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value }} bytes are available.

KubeletTooManyPods warning

kubelet_running_pod_count{job="kubelet"} > 110 * 0.9

Kubelet {{ $labels.instance }} is running {{ $value }} pods, close to the limit of 110.

KubeClientErrors warning

(sum(rate(rest_client_requests_total{code=~"(5..|<error>)",job="kubelet"}[5m])) by (instance)
  /
sum(rate(rest_client_requests_total{job="kubelet"}[5m])) by (instance))
* 100 > 1

The kubelet on {{ $labels.instance }} is experiencing {{ printf “%0.0f” $value }}% errors.

KubeClientErrors warning

(sum(rate(rest_client_requests_total{code=~"(5..|<error>)",job="pods"}[5m])) by (namespace, pod)
  /
sum(rate(rest_client_requests_total{job="pods"}[5m])) by (namespace, pod))
* 100 > 1

The pod {{ $labels.namespace }}/{{ $labels.pod }} is experiencing {{ printf “%0.0f” $value }}% errors.

KubeletRuntimeErrors warning

sum(rate(kubelet_runtime_operations_errors_total{job="kubelet"}[5m])) by (instance) > 0.1

The kubelet on {{ $labels.instance }} is having an elevated error rate for container runtime operations.

KubeletCGroupManagerDurationHigh warning

sum(rate(kubelet_cgroup_manager_duration_seconds{quantile="0.9"}[5m])) by (instance) * 1000 > 1

The kubelet’s cgroup manager duration on {{ $labels.instance }} has been elevated ({{ printf “%0.2f” $value }}ms) for more than 15 minutes.

KubeletPodWorkerDurationHigh warning

sum(rate(kubelet_pod_worker_duration_seconds{quantile="0.9"}[5m])) by (instance, operation_type) * 1000 > 250

The kubelet’s pod worker duration for {{ $labels.operation_type }} operations on {{ $labels.instance }} has been elevated ({{ printf “%0.2f” $value }}ms) for more than 15 minutes.

KubeVersionMismatch warning

count(count(kubernetes_build_info{job!="dns"}) by (gitVersion)) > 1

There are {{ $value }} different versions of Kubernetes components running.

Group kube-state-metrics

KubeStateMetricsDown critical

absent(up{job="kube-state-metrics"} == 1)

KubeStateMetrics has disappeared from Prometheus target discovery.

KubePodCrashLooping critical

rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 0

Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf “%.2f” $value }} times / 5 minutes.

Remediation steps:

Check the pod’s logs.

KubePodNotReady critical

sum by (namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown"}) > 0

Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than an hour.

Remediation steps:

Check the pod via kubectl describe pod [POD] to find out about scheduling issues.

KubeDeploymentGenerationMismatch critical

kube_deployment_status_observed_generation{job="kube-state-metrics"}
  !=
kube_deployment_metadata_generation{job="kube-state-metrics"}

Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back.

KubeDeploymentReplicasMismatch critical

kube_deployment_spec_replicas{job="kube-state-metrics"}
  !=
kube_deployment_status_replicas_available{job="kube-state-metrics"}

Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than an hour.

KubeStatefulSetReplicasMismatch critical

kube_statefulset_status_replicas_ready{job="kube-state-metrics"}
  !=
kube_statefulset_status_replicas{job="kube-state-metrics"}

StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes.

KubeStatefulSetGenerationMismatch critical

kube_statefulset_status_observed_generation{job="kube-state-metrics"}
  !=
kube_statefulset_metadata_generation{job="kube-state-metrics"}

StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back.

KubeStatefulSetUpdateNotRolledOut critical

max without (revision) (
  kube_statefulset_status_current_revision{job="kube-state-metrics"}
    unless
  kube_statefulset_status_update_revision{job="kube-state-metrics"}
)
  *
(
  kube_statefulset_replicas{job="kube-state-metrics"}
    !=
  kube_statefulset_status_replicas_updated{job="kube-state-metrics"}
)

StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.

KubeDaemonSetRolloutStuck critical

kube_daemonset_status_number_ready{job="kube-state-metrics"}
  /
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} * 100 < 100

Only {{ $value }}% of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready.

KubeDaemonSetNotScheduled warning

kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}
  -
kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0

{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.

KubeDaemonSetMisScheduled warning

kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0

{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.

KubeCronJobRunning warning

time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600

CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.

KubeJobCompletion warning

kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0

Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete.

KubeJobFailed warning

kube_job_status_failed{job="kube-state-metrics"} > 0

Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.

KubeCPUOvercommit warning

sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="requests.cpu"})
  /
sum(node:node_num_cpu:sum)
  > 1.5

Cluster has overcommitted CPU resource requests for namespaces.

KubeCPUOvercommit warning

sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)
  /
sum(node:node_num_cpu:sum)
  >
(count(node:node_num_cpu:sum)-1) / count(node:node_num_cpu:sum)

Cluster has overcommitted CPU resource requests for pods and cannot tolerate node failure.

KubeMemOvercommit warning

sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="requests.memory"})
  /
sum(node_memory_MemTotal_bytes{app="node-exporter"})
  > 1.5

Cluster has overcommitted memory resource requests for namespaces.

KubeMemOvercommit warning

sum(namespace_name:kube_pod_container_resource_requests_memory_bytes:sum)
  /
sum(node_memory_MemTotal_bytes)
  >
(count(node:node_num_cpu:sum)-1)
  /
count(node:node_num_cpu:sum)

Cluster has overcommitted memory resource requests for pods and cannot tolerate node failure.

KubeQuotaExceeded warning

100 * kube_resourcequota{job="kube-state-metrics", type="used"}
  / ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
  > 90

Namespace {{ $labels.namespace }} is using {{ printf “%0.0f” $value }}% of its {{ $labels.resource }} quota.

KubePodOOMKilled warning

(kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 30m >= 2)
and
ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[30m]) == 1

Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 30 minutes.

KubeNodeNotReady warning

kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0

{{ $labels.node }} has been unready for more than an hour.

Group node-exporter

NodeFilesystemSpaceFillingUp warning

predict_linear(node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"}[6h], 24*60*60) < 0
and
node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} < 0.4
and
node_filesystem_readonly_bytes{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of space within the next 24 hours.

NodeFilesystemSpaceFillingUp critical

predict_linear(node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"}[6h], 4*60*60) < 0
and
node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} < 0.2
and
node_filesystem_readonly_bytes{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of space within the next 4 hours.

NodeFilesystemOutOfSpace warning

node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 5
and
node_filesystem_readonly_bytes{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ $value }}% available space left.

NodeFilesystemOutOfSpace critical

node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 3
and
node_filesystem_readonly_bytes{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ $value }}% available space left.

NodeFilesystemFilesFillingUp warning

predict_linear(node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"}[6h], 24*60*60) < 0
and
node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} < 0.4
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of files within the next 24 hours.

NodeFilesystemFilesFillingUp warning

predict_linear(node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"}[6h], 4*60*60) < 0
and
node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} < 0.2
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of files within the next 4 hours.

NodeFilesystemOutOfFiles warning

node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 5
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ $value }}% available inodes left.

NodeFilesystemOutOfSpace critical

node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 3
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ $value }}% available space left.

NodeNetworkReceiveErrs critical

increase(node_network_receive_errs_total[2m]) > 10

{{ $labels.instance }} interface {{ $labels.device }} shows errors while receiving packets ({{ $value }} errors in two minutes).

NodeNetworkTransmitErrs critical

increase(node_network_transmit_errs_total[2m]) > 10

{{ $labels.instance }} interface {{ $labels.device }} shows errors while transmitting packets ({{ $value }} errors in two minutes).

Group prometheus

PromScrapeFailed warning

up != 1

Prometheus failed to scrape a target {{ $labels.job }} / {{ $labels.instance }}.

Remediation steps:

Check the Prometheus Service Discovery page to find out why the target is unreachable.

PromBadConfig critical

prometheus_config_last_reload_successful{job="prometheus"} == 0

Prometheus failed to reload config.

Remediation steps:

Check Prometheus pod’s logs via kubectl -n monitoring logs prometheus-0 and -1.
Check the prometheus-rules configmap via kubectl -n monitoring get configmap prometheus-rules -o yaml.

PromAlertmanagerBadConfig critical

alertmanager_config_last_reload_successful{job="alertmanager"} == 0

Alertmanager failed to reload config.

Remediation steps:

Check Alertmanager pod’s logs via kubectl -n monitoring logs alertmanager-0, -1 and -2.
Check the alertmanager secret via kubectl -n monitoring get secret alertmanager -o yaml.

PromAlertsFailed critical

sum(increase(alertmanager_notifications_failed_total{job="alertmanager"}[5m])) by (namespace) > 0

Alertmanager failed to send an alert.

Remediation steps:

Check Prometheus pod’s logs via kubectl -n monitoring logs prometheus-0 and -1.
Make sure the Alertmanager StatefulSet is running: kubectl -n monitoring get pods.

PromRemoteStorageFailures critical

(rate(prometheus_remote_storage_failed_samples_total{job="prometheus"}[1m]) * 100)
  /
(rate(prometheus_remote_storage_failed_samples_total{job="prometheus"}[1m]) + rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus"}[1m]))
  > 1

Prometheus failed to send {{ printf “%.1f” $value }}% samples.

Remediation steps:

Ensure that the Prometheus volume has not reached capacity.
Check Prometheus pod’s logs via kubectl -n monitoring logs prometheus-0 and -1.

PromRuleFailures critical

rate(prometheus_rule_evaluation_failures_total{job="prometheus"}[1m]) > 0

Prometheus failed to evaluate {{ printf “%.1f” $value }} rules/sec.

Remediation steps:

Check Prometheus pod’s logs via kubectl -n monitoring logs prometheus-0 and -1.
Check CPU/memory pressure on the node.

Group thanos

ThanosSidecarDown warning

thanos_sidecar_prometheus_up != 1

The Thanos sidecar in {{ $labels.namespace }}/{{ $labels.pod }} is down.

ThanosSidecarNoHeartbeat warning

time() - thanos_sidecar_last_heartbeat_success_time_seconds > 60

The Thanos sidecar in {{ $labels.namespace }}/{{ $labels.pod }} didn’t send a heartbeat in {{ $value }} seconds.

ThanosCompactorManyRetries warning

sum(rate(thanos_compactor_retries_total[5m])) > 0.01

The Thanos compactor in {{ $labels.namespace }} is experiencing a high retry rate.

Remediation steps:

Check the thanos-compact pod’s logs.

ThanosShipperManyDirSyncFailures warning

sum(rate(thanos_shipper_dir_sync_failures_total[5m])) > 0.01

The Thanos shipper in {{ $labels.namespace }}/{{ $labels.pod }} is experiencing a high dir-sync failure rate.

Remediation steps:

Check the thanos containers’s logs inside the Prometheus pod.

ThanosManyPanicRecoveries warning

sum(rate(thanos_grpc_req_panics_recovered_total[5m])) > 0.01

The Thanos component in {{ $labels.namespace }}/{{ $labels.pod }} is experiencing a panic recovery rate.

ThanosManyBlockLoadFailures warning

sum(rate(thanos_bucket_store_block_load_failures_total[5m])) > 0.01

The Thanos store in {{ $labels.namespace }}/{{ $labels.pod }} is experiencing a many failed block loads.

ThanosManyBlockDropFailures warning

sum(rate(thanos_bucket_store_block_drop_failures_total[5m])) > 0.01

The Thanos store in {{ $labels.namespace }}/{{ $labels.pod }} is experiencing a many failed block drops.

Group velero

VeleroBackupTakesTooLong warning

(velero_backup_attempt_total - velero_backup_success_total) > 0

Backup schedule {{ $labels.schedule }} has been taking more than 60min already.

Remediation steps:

Check if a backup is really in “InProgress” state via velero -n velero backup get.
Check the backup logs via velero -n velero backup logs [BACKUP_NAME].
Depending on the backup, find the pod and check the processes inside that pod or any sidecar containers.

VeleroNoRecentBackup warning

time() - velero_backup_last_successful_timestamp{schedule!=""} > 3600*25

There has not been a successful backup for schedule {{ $labels.schedule }} in the last 24 hours.

Remediation steps:

Check if really no backups happened via velero -n velero backup get.
If a backup failed, check its logs via velero -n velero backup logs [BACKUP_NAME].
If a backup was not even triggered, check the Velero server’s logs via kubectl -n velero logs -l 'name=velero-server'.
Make sure the Velero server pod has not been rescheduled and possibly opt to schedule it on a stable node using a node affinity.

Group kubermatic

KubermaticAPIDown critical

absent(up{job="pods",namespace="kubermatic",role="kubermatic-api"} == 1)

KubermaticAPI has disappeared from Prometheus target discovery.

Remediation steps:

Check the Prometheus Service Discovery page to find out why the target is unreachable.
Ensure that the API pod’s logs and that it is not crashlooping.

KubermaticAPITooManyErrors warning

sum(rate(http_requests_total{role="kubermatic-api",code=~"5.."}[5m])) > 0.1

Kubermatic API is returning a high rate of HTTP 5xx responses.

Remediation steps:

Check the API pod’s logs.

KubermaticAPITooManyInitNodeDeloymentFailures warning

sum(rate(kubermatic_api_init_node_deployment_failures[5m])) > 0.01

Kubermatic API is failing to create too many initial node deployments.

Group kubermatic

KubermaticTooManyUnhandledErrors warning

sum(rate(kubermatic_controller_manager_unhandled_errors_total[5m])) > 0.01

Kubermatic controller manager in {{ $labels.namespace }} is experiencing too many errors.

Remediation steps:

Check the controller-manager pod’s logs.

KubermaticClusterDeletionTakesTooLong warning

(time() - max by (cluster) (kubermatic_cluster_deleted)) > 30*60

Cluster {{ $labels.cluster }} is stuck in deletion for more than 30min.

Remediation steps:

Check the machine-controller’s logs via kubectl -n cluster-XYZ logs -l 'app=machine-controller' for errors related to cloud provider integrations. Expired credentials or manually deleted cloud provider resources are common reasons for failing deletions.
Check the cluster’s status itself via kubectl describe cluster XYZ.
If all resources have been cleaned up, remove the blocking finalizer (e.g. kubermatic.io/delete-nodes) from the cluster resource.
If nothing else helps, manually delete the cluster namespace as a last resort.

KubermaticAddonDeletionTakesTooLong warning

(time() - max by (cluster,addon) (kubermatic_addon_deleted)) > 30*60

Addon {{ $labels.addon }} in cluster {{ $labels.cluster }} is stuck in deletion for more than 30min.

Remediation steps:

Check the kubermatic controller-manager’s logs via kubectl -n kubermatic logs -l 'role=controller-manager' for errors related to deletion of the addon. Manually deleted resources inside of the user cluster is a common reason for failing deletions.
If all resources of the addon inside the user cluster have been cleaned up, remove the blocking finalizer (e.g. cleanup-manifests) from the addon resource.

KubermaticAddonTakesTooLongToReconcile warning

kubermatic_addon_reconcile_failed * on(cluster) group_left() kubermatic_cluster_created - kubermatic_addon_reconcile_failed * on(cluster) group_left() kubermatic_cluster_deleted > 0

Addon {{ $labels.addon }} in cluster {{ $labels.cluster }} has no related resources created for more than 30min.

Remediation steps:

Check the kubermatic controller-manager’s logs via kubectl -n kubermatic logs -l 'role=controller-manager' for errors related to reconciliation of the addon.

KubermaticControllerManagerDown critical

absent(up{job="pods",namespace="kubermatic",role="controller-manager"} == 1)

KubermaticControllerManager has disappeared from Prometheus target discovery.

Remediation steps:

Check the Prometheus Service Discovery page to find out why the target is unreachable.
Ensure that the controller-manager pod’s logs and that it is not crashlooping.

OpenVPNServerDown critical

absent(kube_deployment_status_replicas_available{cluster!="",deployment="openvpn-server"} > 0) and count(kubermatic_cluster_info) > 0

There is no healthy OpenVPN server in cluster {{ $labels.cluster }}.

UserClusterPrometheusAbsent critical

(
  kubermatic_cluster_info * on (name) group_left
  label_replace(up{job="clusters"}, "name", "$1", "namespace", "cluster-(.+)")
  or
  kubermatic_cluster_info * 0
) == 0

There is no Prometheus in cluster {{ $labels.name }}.

KubermaticClusterPaused none

label_replace(kubermatic_cluster_info{pause="true"}, "cluster", "$0", "name", ".+")

Cluster {{ $labels.name }} has been paused and will not be reconciled until the pause flag is reset.

Group kube-controller-manager

KubeControllerManagerDown critical

absent(:ready_kube_controller_managers:sum) or :ready_kube_controller_managers:sum == 0

No healthy controller-manager pods exist inside the cluster.

Group kube-scheduler

KubeSchedulerDown critical

absent(:ready_kube_schedulers:sum) or :ready_kube_schedulers:sum == 0

No healthy scheduler pods exist inside the cluster.