Alerting Runbook

This page serves as a reference to the alerts that a standard Kubermatic Kubernetes Platform (KKP) monitoring setup can fire, alongside a short description and steps to debug.

Under Development

Group blackbox-exporter

HttpProbeFailed warning

probe_success != 1

Probing the blackbox-exporter target {{ $labels.instance }} failed.

HttpProbeSlow warning

sum by (instance) (probe_http_duration_seconds) > 3

{{ $labels.instance }} takes {{ $value }} seconds to respond.

Remediation steps:

  • Check the target system’s resource usage for anomalias.
  • Check if the target application has been recently rescheduled and is still settling.

HttpCertExpiresSoon warning

probe_ssl_earliest_cert_expiry - time() < 3*24*3600

The certificate for {{ $labels.instance }} expires in less than 3 days.

HttpCertExpiresVerySoon critical

probe_ssl_earliest_cert_expiry - time() < 24*3600

The certificate for {{ $labels.instance }} expires in less than 24 hours.

Group cert-manager

CertManagerCertExpiresSoon warning

certmanager_certificate_expiration_timestamp_seconds - time() < 3*24*3600

The certificate {{ $labels.name }} expires in less than 3 days.

CertManagerCertExpiresVerySoon critical

certmanager_certificate_expiration_timestamp_seconds - time() < 24*3600

The certificate {{ $labels.name }} expires in less than 24 hours.

Group helm-exporter

HelmReleaseNotDeployed warning

helm_chart_info != 1

The Helm release {{ $labels.release }} ({{ $labels.chart }} chart in namespace {{ $labels.exported_namespace }}) in version {{ $labels.version }} has not been ready for more than 15 minutes.

Remediation steps:

  • Check the installed Helm releases via helm --namespace monitoring ls --all.
  • If Helm cannot repair the chart automatically, delete/purge the chart (helm delete --purge [RELEASE]) and re-install the chart again.

Group kube-apiserver

KubernetesApiserverDown critical

absent(up{job="apiserver"} == 1)

KubernetesApiserver has disappeared from Prometheus target discovery.

KubeAPITerminatedRequests warning

sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m]))
  /
(sum(rate(apiserver_request_total{job="apiserver"}[10m])) + sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m])) ) > 0.20

The kubernetes apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.

KubeAPITerminatedRequests critical

sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m]))
  /
(sum(rate(apiserver_request_total{job="apiserver"}[10m])) + sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m])) ) > 0.20

The kubernetes apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.

KubeClientCertificateExpiration warning

apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0
and
histogram_quantile(0.01, sum by (job, instance, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800

A client certificate used to authenticate to the apiserver is expiring in less than 7 days.

Remediation steps:

  • Check the Kubernetes documentation on how to renew certificates.
  • If your certificate has already expired, the steps in the documentation might not work. Check Github for hints about fixing your cluster.

KubeClientCertificateExpiration critical

apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0
and
histogram_quantile(0.01, sum by (job, instance, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400

A client certificate used to authenticate to the apiserver is expiring in less than 24 hours.

Remediation steps:

  • Urgently renew your certificates. Expired certificates can make fixing the cluster difficult to begin with.
  • Check the Kubernetes documentation on how to renew certificates.
  • If your certificate has already expired, the steps in the documentation might not work. Check Github for hints about fixing your cluster.

Group kube-kubelet

KubeletDown critical

absent(up{job="kubelet"} == 1)

Kubelet has disappeared from Prometheus target discovery.

KubePersistentVolumeFillingUp critical

(
  kubelet_volume_stats_available_bytes{job="kubelet"}
    /
  kubelet_volume_stats_capacity_bytes{job="kubelet"}
) < 0.05
and
kubelet_volume_stats_used_bytes{job="kubelet"} > 0
unless on(namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on(namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free.

KubePersistentVolumeFillingUp warning

(
  kubelet_volume_stats_available_bytes{job="kubelet"}
    /
  kubelet_volume_stats_capacity_bytes{job="kubelet"}
) < 0.07
and
kubelet_volume_stats_used_bytes{job="kubelet"} > 0
unless on(namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on(namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free.

KubePersistentVolumeFillingUp warning

(
  kubelet_volume_stats_available_bytes{job="kubelet"}
    /
  kubelet_volume_stats_capacity_bytes{job="kubelet"}
) < 0.15
and
kubelet_volume_stats_used_bytes{job="kubelet"} > 0
and
predict_linear(kubelet_volume_stats_available_bytes{job="kubelet"}[6h], 4 * 24 * 3600) < 0
unless on(namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on(namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value | humanizePercentage }} is available.

KubePersistentVolumeInodesFillingUp critical

(
  kubelet_volume_stats_inodes_free{job="kubelet"}
    /
  kubelet_volume_stats_inodes{job="kubelet"}
) < 0.03
and
kubelet_volume_stats_inodes_used{job="kubelet"} > 0
unless on(namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on(namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} only has {{ $value | humanizePercentage }} free inodes.

KubePersistentVolumeInodesFillingUp warning

(
  kubelet_volume_stats_inodes_free{job="kubelet"}
    /
  kubelet_volume_stats_inodes{job="kubelet"}
) < 0.15
and
kubelet_volume_stats_inodes_used{job="kubelet"} > 0
and
predict_linear(kubelet_volume_stats_inodes_free{job="kubelet"}[6h], 4 * 24 * 3600) < 0
unless on(namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on(namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to run out of inodes within four days. Currently {{ $value | humanizePercentage }} of its inodes are free.

KubePersistentVolumeErrors critical

kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0

The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}.

KubeletTooManyPods warning

kubelet_running_pod_count{job="kubelet"} > 110 * 0.9

Kubelet {{ $labels.instance }} is running {{ $value }} pods, close to the limit of 110.

KubeletClientErrors warning

(sum(rate(rest_client_requests_total{code=~"(5..|<error>)",job="kubelet"}[5m])) by (instance)
  /
sum(rate(rest_client_requests_total{job="kubelet"}[5m])) by (instance))
* 100 > 1

The kubelet on {{ $labels.instance }} is experiencing {{ printf “%0.0f” $value }}% errors.

KubeClientErrors warning

(sum(rate(rest_client_requests_total{code=~"(5..|<error>)",job="pods"}[5m])) by (namespace, pod)
  /
sum(rate(rest_client_requests_total{job="pods"}[5m])) by (namespace, pod))
* 100 > 1

The pod {{ $labels.namespace }}/{{ $labels.pod }} is experiencing {{ printf “%0.0f” $value }}% errors.

KubeletRuntimeErrors warning

sum(rate(kubelet_runtime_operations_errors_total{job="kubelet"}[5m])) by (instance) > 0.1

The kubelet on {{ $labels.instance }} is having an elevated error rate for container runtime operations.

KubeletCGroupManagerDurationHigh warning

sum(rate(kubelet_cgroup_manager_duration_seconds{quantile="0.9"}[5m])) by (instance) * 1000 > 1

The kubelet’s cgroup manager duration on {{ $labels.instance }} has been elevated ({{ printf “%0.2f” $value }}ms) for more than 15 minutes.

KubeletPodWorkerDurationHigh warning

sum(rate(kubelet_pod_worker_duration_seconds{quantile="0.9"}[5m])) by (instance, operation_type) * 1000 > 250

The kubelet’s pod worker duration for {{ $labels.operation_type }} operations on {{ $labels.instance }} has been elevated ({{ printf “%0.2f” $value }}ms) for more than 15 minutes.

KubeVersionMismatch warning

count(count(kubernetes_build_info{job!="dns"}) by (gitVersion)) > 1

There are {{ $value }} different versions of Kubernetes components running.

Group kube-state-metrics

KubeStateMetricsDown critical

absent(up{job="kube-state-metrics"} == 1)

KubeStateMetrics has disappeared from Prometheus target discovery.

KubePodCrashLooping critical

max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics"}[5m]) >= 1

Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf “%.2f” $value }} times / 5 minutes.

Remediation steps:

  • Check the pod’s logs.

KubePodNotReady critical

sum by (namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown"}) > 0

Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than an hour.

Remediation steps:

  • Check the pod via kubectl describe pod [POD] to find out about scheduling issues.

KubeDeploymentGenerationMismatch critical

kube_deployment_status_observed_generation{job="kube-state-metrics"}
  !=
kube_deployment_metadata_generation{job="kube-state-metrics"}

Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back.

KubeDeploymentReplicasMismatch critical

kube_deployment_spec_replicas{job="kube-state-metrics"}
  !=
kube_deployment_status_replicas_available{job="kube-state-metrics"}

Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than an hour.

KubeStatefulSetReplicasMismatch critical

kube_statefulset_status_replicas_ready{job="kube-state-metrics"}
  !=
kube_statefulset_status_replicas{job="kube-state-metrics"}

StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes.

KubeStatefulSetGenerationMismatch critical

kube_statefulset_status_observed_generation{job="kube-state-metrics"}
  !=
kube_statefulset_metadata_generation{job="kube-state-metrics"}

StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back.

KubeStatefulSetUpdateNotRolledOut critical

max without (revision) (
  kube_statefulset_status_current_revision{job="kube-state-metrics"}
    unless
  kube_statefulset_status_update_revision{job="kube-state-metrics"}
)
  *
(
  kube_statefulset_replicas{job="kube-state-metrics"}
    !=
  kube_statefulset_status_replicas_updated{job="kube-state-metrics"}
)

StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.

KubeDaemonSetRolloutStuck critical

kube_daemonset_status_number_ready{job="kube-state-metrics"}
  /
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} * 100 < 100

Only {{ $value }}% of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready.

KubeDaemonSetNotScheduled warning

kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}
  -
kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0

{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.

KubeDaemonSetMisScheduled warning

kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0

{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.

KubeCronJobRunning warning

time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600

CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.

KubeJobCompletion warning

time() - max by(namespace, job_name, cluster) (kube_job_status_start_time{job="kube-state-metrics"}
  and
kube_job_status_active{job="kube-state-metrics"} > 0) > 43200

Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete.

KubeJobFailed warning

kube_job_status_failed{job="kube-state-metrics"} > 0

Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.

KubeCPUOvercommit warning

sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="requests.cpu"})
  /
sum(node:node_num_cpu:sum)
  > 1.5

Cluster has overcommitted CPU resource requests for namespaces.

KubeCPUOvercommit critical

sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)
  -
(sum(kube_node_status_allocatable{resource="cpu"}) - max(kube_node_status_allocatable{resource="cpu"}))
  > 0
and
(sum(kube_node_status_allocatable{resource="cpu"})
  -
max(kube_node_status_allocatable{resource="cpu"}))
  > 0

Cluster has overcommitted CPU resource requests for Pods by {{ $value }} CPU shares and cannot tolerate node failure.

KubeMemOvercommit warning

sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="requests.memory"})
  /
sum(node_memory_MemTotal_bytes{app="node-exporter"})
  > 1.5

Cluster has overcommitted memory resource requests for namespaces.

KubeMemOvercommit critical

sum(namespace_name:kube_pod_container_resource_requests_memory_bytes:sum)
  -
(sum(kube_node_status_allocatable{resource="memory"}) - max(kube_node_status_allocatable{resource="memory"}))
  > 0
and
(sum(kube_node_status_allocatable{resource="memory"})
  -
max(kube_node_status_allocatable{resource="memory"}))
  > 0

Cluster has overcommitted memory resource requests for Pods by {{ $value }} bytes and cannot tolerate node failure.

KubeQuotaExceeded warning

100 * kube_resourcequota{job="kube-state-metrics", type="used"}
  / ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
  > 90

Namespace {{ $labels.namespace }} is using {{ printf “%0.0f” $value }}% of its {{ $labels.resource }} quota.

KubePodOOMKilled warning

(kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 30m >= 2)
and
ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[30m]) == 1

Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 30 minutes.

KubeNodeNotReady warning

kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0

{{ $labels.node }} has been unready for more than an hour.

Group node-exporter

NodeFilesystemSpaceFillingUp warning

predict_linear(node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"}[6h], 24*60*60) < 0
and
node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} < 0.4
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of space within the next 24 hours.

NodeFilesystemSpaceFillingUp critical

predict_linear(node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"}[6h], 4*60*60) < 0
and
node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} < 0.2
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of space within the next 4 hours.

NodeFilesystemOutOfSpace warning

node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 10
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on node {{ $labels.node_name }} having IP {{ $labels.instance }} has only {{ $value }}% available space left on drive {{ $labels.device }}.

NodeFilesystemOutOfSpace critical

node_filesystem_avail_bytes{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_size_bytes{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 5
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on node {{ $labels.node_name }} having IP {{ $labels.instance }} has only {{ $value }}% available space left on drive {{ $labels.device }}.

NodeFilesystemFilesOutOfSpace critical

node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 10
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on node {{ $labels.node_name }} having IP {{ $labels.instance }} has only {{ $value }}% inodes available on drive {{ $labels.device }}.

NodeFilesystemFilesFillingUp warning

predict_linear(node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"}[6h], 24*60*60) < 0
and
node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} < 0.4
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of files within the next 24 hours.

NodeFilesystemFilesFillingUp critical

predict_linear(node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"}[6h], 4*60*60) < 0
and
node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} < 0.2
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of files within the next 4 hours.

NodeFilesystemOutOfFiles warning

node_filesystem_files_free{app="node-exporter",fstype=~"ext.|xfs"} / node_filesystem_files{app="node-exporter",fstype=~"ext.|xfs"} * 100 < 5
and
node_filesystem_readonly{app="node-exporter",fstype=~"ext.|xfs"} == 0

Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ $value }}% available inodes left.

NodeNetworkReceiveErrs critical

increase(node_network_receive_errs_total[2m]) > 10

{{ $labels.instance }} interface {{ $labels.device }} shows errors while receiving packets ({{ $value }} errors in two minutes).

NodeNetworkTransmitErrs critical

increase(node_network_transmit_errs_total[2m]) > 10

{{ $labels.instance }} interface {{ $labels.device }} shows errors while transmitting packets ({{ $value }} errors in two minutes).

NodeTimeDrift critical

abs(timestamp(node_time_seconds) - node_time_seconds) > 1

Time on Node {{ $labels.node_name }} drifts by a {{ $value }} seconds.

Group prometheus

PromScrapeFailed warning

up != 1

Prometheus failed to scrape a target {{ $labels.job }} / {{ $labels.instance }}.

Remediation steps:

  • Check the Prometheus Service Discovery page to find out why the target is unreachable.

PromBadConfig critical

prometheus_config_last_reload_successful{job="prometheus"} == 0

Prometheus failed to reload config.

Remediation steps:

  • Check Prometheus pod’s logs via kubectl -n monitoring logs prometheus-0 and -1.
  • Check the prometheus-rules configmap via kubectl -n monitoring get configmap prometheus-rules -o yaml.

PromAlertmanagerBadConfig critical

alertmanager_config_last_reload_successful{job="alertmanager"} == 0

Alertmanager failed to reload config.

Remediation steps:

  • Check Alertmanager pod’s logs via kubectl -n monitoring logs alertmanager-0, -1 and -2.
  • Check the alertmanager secret via kubectl -n monitoring get secret alertmanager -o yaml.

PromAlertsFailed critical

sum(increase(alertmanager_notifications_failed_total{job="alertmanager"}[5m])) by (namespace) > 0

Alertmanager failed to send an alert.

Remediation steps:

  • Check Prometheus pod’s logs via kubectl -n monitoring logs prometheus-0 and -1.
  • Make sure the Alertmanager StatefulSet is running: kubectl -n monitoring get pods.

PromRemoteStorageFailures critical

(rate(prometheus_remote_storage_failed_samples_total{job="prometheus"}[1m]) * 100)
  /
(rate(prometheus_remote_storage_failed_samples_total{job="prometheus"}[1m]) + rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus"}[1m]))
  > 1

Prometheus failed to send {{ printf “%.1f” $value }}% samples.

Remediation steps:

  • Ensure that the Prometheus volume has not reached capacity.
  • Check Prometheus pod’s logs via kubectl -n monitoring logs prometheus-0 and -1.

PromRuleFailures critical

rate(prometheus_rule_evaluation_failures_total{job="prometheus"}[1m]) > 0

Prometheus failed to evaluate {{ printf “%.1f” $value }} rules/sec.

Remediation steps:

  • Check Prometheus pod’s logs via kubectl -n monitoring logs prometheus-0 and -1.
  • Check CPU/memory pressure on the node.

Group velero

VeleroBackupTakesTooLong warning

time() - velero_backup_last_successful_timestamp{schedule!=""} > 3600

Last backup with schedule {{ $labels.schedule }} has not finished successfully within 60min.

Remediation steps:

  • Check if a backup is really in “InProgress” state via velero -n velero backup get.
  • Check the backup logs via velero -n velero backup logs [BACKUP_NAME].
  • Depending on the backup, find the pod and check the processes inside that pod or any sidecar containers.

VeleroNoRecentBackup critical

time() - velero_backup_last_successful_timestamp{schedule!=""} > 3600*25

There has not been a successful backup for schedule {{ $labels.schedule }} in the last 24 hours.

Remediation steps:

  • Check if really no backups happened via velero -n velero backup get.
  • If a backup failed, check its logs via velero -n velero backup logs [BACKUP_NAME].
  • If a backup was not even triggered, check the Velero server’s logs via kubectl -n velero logs -l 'name=velero-server'.
  • Make sure the Velero server pod has not been rescheduled and possibly opt to schedule it on a stable node using a node affinity.

Group kubermatic

KubermaticAPIDown critical

absent(up{job="pods",namespace="kubermatic",app_kubernetes_io_name="kubermatic-api"} == 1)

KubermaticAPI has disappeared from Prometheus target discovery.

Remediation steps:

  • Check the Prometheus Service Discovery page to find out why the target is unreachable.
  • Ensure that the API pod’s logs and that it is not crashlooping.

KubermaticAPITooManyErrors warning

sum(rate(http_requests_total{app_kubernetes_io_name="kubermatic-api",code=~"5.."}[5m])) > 0.1

Kubermatic API is returning a high rate of HTTP 5xx responses.

Remediation steps:

  • Check the API pod’s logs.

KubermaticAPITooManyInitNodeDeloymentFailures warning

sum(rate(kubermatic_api_failed_init_node_deployment_total[5m])) > 0.01

Kubermatic API is failing to create too many initial node deployments.

KubermaticMasterControllerManagerDown critical

absent(up{job="pods",namespace="kubermatic",app_kubernetes_io_name="kubermatic-master-controller-manager"} == 1)

Kubermatic Master Controller Manager has disappeared from Prometheus target discovery.

Remediation steps:

  • Check the Prometheus Service Discovery page to find out why the target is unreachable.
  • Ensure that the master-controller-manager pod’s logs and that it is not crashlooping.

KubermaticSeedNotHealthy warning

kubermatic_seed_info{phase!="Healthy"}

The Seed cluster {{ $labels.seed_name }} cannot be reached or reconciled properly.

Remediation steps:

  • Check the conditions on the Seed object to learn more about the issues.
  • Check the kubermatic-operator logs for additional information.
  • Ensure that a valid kubeconfig Secret exists for the Seed.

Group kubermatic

KubermaticTooManyUnhandledErrors warning

sum(rate(kubermatic_controller_manager_unhandled_errors_total[5m])) > 0.01

Kubermatic controller manager in {{ $labels.namespace }} is experiencing too many errors.

Remediation steps:

  • Check the controller-manager pod’s logs.

KubermaticClusterDeletionTakesTooLong warning

(time() - max by (cluster) (kubermatic_cluster_deleted)) > 30*60

Cluster {{ $labels.cluster }} is stuck in deletion for more than 30min.

Remediation steps:

  • Check the machine-controller’s logs via kubectl -n cluster-XYZ logs -l 'app=machine-controller' for errors related to cloud provider integrations. Expired credentials or manually deleted cloud provider resources are common reasons for failing deletions.
  • Check the cluster’s status itself via kubectl describe cluster XYZ.
  • If all resources have been cleaned up, remove the blocking finalizer (e.g. kubermatic.io/delete-nodes) from the cluster resource.
  • If nothing else helps, manually delete the cluster namespace as a last resort.

KubermaticAddonDeletionTakesTooLong warning

(time() - max by (cluster,addon) (kubermatic_addon_deleted)) > 30*60

Addon {{ $labels.addon }} in cluster {{ $labels.cluster }} is stuck in deletion for more than 30min.

Remediation steps:

  • Check the kubermatic controller-manager’s logs via kubectl -n kubermatic logs -l 'app.kubernetes.io/name=kubermatic-seed-controller-manager' for errors related to deletion of the addon. Manually deleted resources inside of the user cluster is a common reason for failing deletions.
  • If all resources of the addon inside the user cluster have been cleaned up, remove the blocking finalizer (e.g. cleanup-manifests) from the addon resource.

KubermaticAddonTakesTooLongToReconcile warning

kubermatic_addon_reconcile_failed * on(cluster) group_left() kubermatic_cluster_created
- kubermatic_addon_reconcile_failed * on(cluster) group_left() kubermatic_cluster_deleted
> 0

Addon {{ $labels.addon }} in cluster {{ $labels.cluster }} has no related resources created for more than 30min.

Remediation steps:

  • Check the kubermatic seed controller-manager’s logs via kubectl -n kubermatic logs -l 'app.kubernetes.io/name=kubermatic-seed-controller-manager' for errors related to reconciliation of the addon.

KubermaticSeedControllerManagerDown critical

absent(up{job="pods",namespace="kubermatic",app_kubernetes_io_name="kubermatic-seed-controller-manager"} == 1)

Kubermatic Seed Controller Manager has disappeared from Prometheus target discovery.

Remediation steps:

  • Check the Prometheus Service Discovery page to find out why the target is unreachable.
  • Ensure that the seed-controller-manager pod’s logs and that it is not crashlooping.

OpenVPNServerDown critical

(kube_deployment_status_replicas_available{cluster!="",deployment="openvpn-server"} != kube_deployment_spec_replicas{cluster!="",deployment="openvpn-server"}) and count(kubermatic_cluster_info) > 0

There is no healthy OpenVPN server in cluster {{ $labels.cluster }}.

UserClusterPrometheusAbsent critical

(
  kubermatic_cluster_info * on (name) group_left
  label_replace(up{job="clusters"}, "name", "$1", "namespace", "cluster-(.+)")
  or
  kubermatic_cluster_info * 0
) == 0

There is no Prometheus in cluster {{ $labels.name }}.

KubermaticClusterPaused informational

label_replace(kubermatic_cluster_info{pause="true"}, "cluster", "$0", "name", ".+")

Cluster {{ $labels.name }} has been paused and will not be reconciled until the pause flag is reset.

Group kube-controller-manager

KubeControllerManagerDown critical

absent(:ready_kube_controller_managers:sum) or :ready_kube_controller_managers:sum == 0

No healthy controller-manager pods exist inside the cluster.

Group kube-scheduler

KubeSchedulerDown critical

absent(:ready_kube_schedulers:sum) or :ready_kube_schedulers:sum == 0

No healthy scheduler pods exist inside the cluster.

Group cortex

CortexDistributorDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="distributor",app_kubernetes_io_name="cortex"} == 1)

Cortex-distributor has disappeared from Prometheus target discovery.

CortexQuerierDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="querier",app_kubernetes_io_name="cortex"} == 1)

Cortex-querier has disappeared from Prometheus target discovery.

CortexQueryFrontendDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="query-frontend",app_kubernetes_io_name="cortex"} == 1)

Cortex-query-frontend has disappeared from Prometheus target discovery.

CortexRulerDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="ruler",app_kubernetes_io_name="cortex"} == 1)

Cortex-ruler has disappeared from Prometheus target discovery.

CortexMemcachedBlocksDown warning

absent(up{job="pods",namespace="mla", app_kubernetes_io_instance="cortex",app_kubernetes_io_name="memcached-blocks"} == 1)

Cortex-memcached-blocks has disappeared from Prometheus target discovery.

CortexMemcachedBlocksMetadataDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_instance="cortex",app_kubernetes_io_name="memcached-blocks-metadata"} == 1)

Cortex-memcached-blocks-metadata has disappeared from Prometheus target discovery.

CortexMemcachedBlocksIndexDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_instance="cortex",app_kubernetes_io_name="memcached-blocks-index"} == 1)

Cortex-memcached-blocks-index has disappeared from Prometheus target discovery.

CortexAlertmanagerDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="alertmanager",app_kubernetes_io_name="cortex"} == 1)

Cortex-alertmanager has disappeared from Prometheus target discovery.

CortexCompactorDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="compactor",app_kubernetes_io_name="cortex"} == 1)

Cortex-compactor has disappeared from Prometheus target discovery.

CortexIngesterDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="ingester",app_kubernetes_io_name="cortex"} == 1)

Cortex-ingester has disappeared from Prometheus target discovery.

CortexStoreGatewayDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="store-gateway",app_kubernetes_io_name="cortex"} == 1)

Cortex-store-gateway has disappeared from Prometheus target discovery.

Group loki-distributed

LokiIngesterDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="ingester",app_kubernetes_io_name="loki-distributed"} == 1)

Loki-ingester has disappeared from Prometheus target discovery.

LokiDistributorDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="distributor",app_kubernetes_io_name="loki-distributed"} == 1)

Loki-distributor has disappeared from Prometheus target discovery.

LokiQuerierDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="querier",app_kubernetes_io_name="loki-distributed"} == 1)

Loki-querier has disappeared from Prometheus target discovery.

LokiQueryFrontendDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="query-frontend",app_kubernetes_io_name="loki-distributed"} == 1)

Loki-query-frontend has disappeared from Prometheus target discovery.

LokiTableManagerDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="table-manager",app_kubernetes_io_name="loki-distributed"} == 1)

Loki-table-manager has disappeared from Prometheus target discovery.

LokiCompactorDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="compactor",app_kubernetes_io_name="loki-distributed"} == 1)

Loki-compactor has disappeared from Prometheus target discovery.

LokiRulerDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="ruler",app_kubernetes_io_name="loki-distributed"} == 1)

Loki-ruler has disappeared from Prometheus target discovery.

LokiMemcachedChunksDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="memcached-chunks",app_kubernetes_io_name="loki-distributed"} == 1)

Loki-memcached-chunks has disappeared from Prometheus target discovery.

LokiMemcachedFrontendDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="memcached-frontend",app_kubernetes_io_name="loki-distributed"} == 1)

Loki-memcached-frontend has disappeared from Prometheus target discovery.

LokiMemcachedIndexQueriesDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="memcached-index-queries",app_kubernetes_io_name="loki-distributed"} == 1)

Loki-memcached-index-queries has disappeared from Prometheus target discovery.

LokiMemcachedIndexWritesDown warning

absent(up{job="pods",namespace="mla",app_kubernetes_io_component="memcached-index-writes",app_kubernetes_io_name="loki-distributed"} == 1)

Loki-memcached-index-writes has disappeared from Prometheus target discovery.