Restoring from Backup

This page documents a manual restore procedure in case the legacy backup controllers were used to create the backup object in question. KKP v2.24 will remove the legacy backup controllers. The current implementation supports automated restores, so this procedure should not be used. Use the restore functionality in KKP directly instead.

Intro

The etcds of the user-clusters are being backed up on a configured interval. This document will lead through the process of restoring a complete etcd StatefulSet from a single snapshot.

Pausing the Cluster

Restoring a etcd requires manual intervention. As the StatefulSet needs to be modified, the affected cluster needs to be removed from the controllers management:

# set cluster.spec.pause=true
kubectl edit cluster xxxxxxxxxx

Pausing the StatefulSet

To restore an etcd, the etcd must not be running. Therefore the etcd statefulset must be configured to just execute a exec /bin/sleep 86400.

# change command to run 'exec /bin/sleep 86400'
kubectl -n cluster-xxxxxxxxxx edit statefulset etcd

Deleting All PVCs

To ensure that we start on each pod with a empty disk, we delete all PVC’s. The StatefulSet will create new ones with empty PV’s automatically.

kubectl -n cluster-xxxxxxxxxx delete pvc -l app=etcd

Deleting All Pods

To ensure all Pods start with the sleep command and with new PV’s, all etcd pods must be deleted.

kubectl -n cluster-xxxxxxxxxx delete pod -l app=etcd

Restoring etcd (Must Be Executed on All etcd Pods)

The restore command is different for each member. Make sure to update it gets executed.

# Copy snapshot into pod
kubectl cp snapshot.db cluster-xxxxxxxxxx/etcd-0:/var/run/etcd/
# Exec into the pod
kubectl -n cluster-xxxxxxxxxx exec -ti etcd-0 sh

cd /var/run/etcd/
# Inside the pod, restore from the snapshot
# This command is specific to each member.
export MEMBER=etcd-0
export CLUSTER_ID=xxxxxxxxxx

With etcd-launcher Enabled

If etcd-launcher is enabled (which it is by default since KKP v2.22), the restore command needs to use TLS-enabled endpoints:

etcdctl snapshot restore snapshot.db \
  --name ${MEMBER} \
  --initial-cluster etcd-0=https://etcd-0.etcd.cluster-${CLUSTER_ID}.svc.cluster.local:2381,etcd-1=https://etcd-1.etcd.cluster-${CLUSTER_ID}.svc.cluster.local:2381,etcd-2=https://etcd-2.etcd.cluster-${CLUSTER_ID}.svc.cluster.local:2381 \
  --initial-cluster-token ${CLUSTER_ID} \
  --initial-advertise-peer-urls https://${MEMBER}.etcd.cluster-${CLUSTER_ID}.svc.cluster.local:2381 \
  --data-dir /var/run/etcd/pod_${MEMBER}/

With etcd-launcher Disabled

If etcd-launcher is disabled (which is not recommended), the restore command needs to use plain HTTP networking:

etcdctl snapshot restore snapshot.db \
  --name ${MEMBER} \
  --initial-cluster etcd-0=http://etcd-0.etcd.cluster-${CLUSTER_ID}.svc.cluster.local:2380,etcd-1=http://etcd-1.etcd.cluster-${CLUSTER_ID}.svc.cluster.local:2380,etcd-2=http://etcd-2.etcd.cluster-${CLUSTER_ID}.svc.cluster.local:2380 \
  --initial-cluster-token ${CLUSTER_ID} \
  --initial-advertise-peer-urls http://${MEMBER}.etcd.cluster-${CLUSTER_ID}.svc.cluster.local:2380 \
  --data-dir /var/run/etcd/pod_${MEMBER}/

Un-Pausing the Cluster

To let the kubermatic Kubernetes Platform (KKP)-controller-manager update the etcd to normal state, un-pause it.

# set cluster.spec.pause=false
kubectl edit cluster xxxxxxxxxx

Delete etcd-Pods

As the rolling-update of the etcd won’t finish, all etcd pods must be manually.

kubectl -n cluster-xxxxxxxxxx delete pod -l app=etcd