Monitoring Etcd Ring and Replacing Corrupted Members

The etcd maintainers are no longer recommending running etcd v3.5 in production. They have found out that if the etcd process is killed under high load, occasionally some committed transactions are not reflected on all the members. The problem affects etcd versions v3.5.0, v3.5.1, v3.5.2, and is planned to be fixed in v3.5.3 (release date is TBD). You can check out the email from the etcd maintainers for more details.

We’re deploying etcd v3.5 by default for all Kubernetes 1.22 and newer clusters. We heavily advise taking the following actions:

  • If you are already running Kubernetes 1.22 or newer
    • Follow the Enabling Etcd Corruption Checks part of the document to enable the etcd corruption checks. Those corruption checks will not fix the data consistency issues, but they’ll prevent corrupted etcd members from joining or staying in the etcd ring
    • Make sure your cluster has sufficient CPU, memory, and storage
    • Monitor your cluster etcd ring to make sure there’s no corruption
    • Frequently backup your etcd ring. You can do that by setting up the backups-restic addon
  • If you are NOT running Kubernetes 1.22 or newer
    • Postpone upgrades of existing clusters to or deploying new clusters with Kubernetes 1.22 or newer until a fixed etcd version is available from the etcd maintainers

Enabling Etcd Corruption Checks

The etcd corruption checks are enabled by default starting with KubeOne 1.4.1. Before proceeding, make sure that you’re running KubeOne 1.4.1 or newer. You can do that by running the version command:

kubeone version

The gitVersion should be 1.4.1 or newer:

{
  "kubeone": {
    "major": "1",
    "minor": "4",
    "gitVersion": "1.4.1",
    "gitCommit": "d44b1a474a3894f1cf685b299fae1c725c1ccb1f",
    "gitTreeState": "",
    "buildDate": "2022-04-04T08:49:52Z",
    "goVersion": "go1.17.5",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "machine_controller": {
    "major": "1",
    "minor": "43",
    "gitVersion": "v1.43.0",
    "gitCommit": "",
    "gitTreeState": "",
    "buildDate": "",
    "goVersion": "",
    "compiler": "",
    "platform": "linux/amd64"
  }
}

To enable the corruption checks, you need to force upgrade your cluster. This means running the upgrade process without changing the Kubernetes version, in order to trigger regenerating manifests for etcd.

kubeone apply -m kubeone.yaml -t tf.json --force-upgrade

This process might take up to 10 minutes. After it’s done, you can use the following command to validate that all etcd pods have required flags:

kubectl get pods -n kube-system -l component=etcd -o jsonpath='{range .items[*]}{.metadata.name}: {range .spec.containers[0].command[*]}{}{"\n"}{end}{"\n"}{end}'

Each etcd pods should have the following two flags:

--experimental-corrupt-check-time=240m
--experimental-initial-corrupt-check=true

If you run into any issue, create an issue in the KubeOne repository.

Monitoring The Etcd Ring

We strongly recommend setting up some monitoring and alerting stack that would allow you to automatically receive alerts if an etcd member becomes corrupt.

We strongly recommend checking the status of the etcd ring frequently to make sure there are no corrupted members.

Checking etcd pods status

First, ensure that all etcd pods are Running.

kubectl get pods -n kube-system -l component=etcd
NAME                                                READY   STATUS    RESTARTS   AGE
etcd-ip-172-31-195-53.eu-west-3.compute.internal    1/1     Running   0          7m19s
etcd-ip-172-31-196-114.eu-west-3.compute.internal   1/1     Running   0          6m36s
etcd-ip-172-31-197-44.eu-west-3.compute.internal    1/1     Running   0          5m33s

If you see any pod that is restarting or not Running, you should check the logs and then replace the affected etcd member if needed.

Checking the etcd logs

Check logs for each etcd pod and make sure there are no logs related to the etcd corruption.

You might use the following commands:

kubectl logs -n kube-system <etcd-pod-name>
kubectl logs -n kube-system <etcd-pod-name> | grep -i corrupt

You should see the following log message on all etcd members:

{"level":"info","ts":"2022-04-05T11:16:26.368Z","caller":"etcdserver/corrupt.go:116","msg":"initial corruption checking passed; no corruption","local-member-id":"f39a5c54fd589f35"}

The periodic corruption checks (every 4 hours) are done only on the leader etcd member, where you should see a log message such as the following one:

{"level":"info","ts":"2022-04-05T13:29:52.601Z","caller":"etcdserver/corrupt.go:244","msg":"finished peer corruption check","number-of-peers-checked":2}

If you see any logs mentioning that your etcd member is corrupted you MUST follow the Replacing a Corrupted Etcd Member section of this document to replace it.

Replacing a Corrupted Etcd Member

If you found that you have a corrupted etcd member, you MUST replace it as soon as possible. Replacing is done by resetting the node where the corrupted member is running, and then letting KubeOne join it a cluster again.

This guide assumes that only one etcd member is affected, i.e. that the etcd quorum is still satisfied. If your etcd ring lost the quorum, it might not be possible to recover it by following this guide.

First, determine the node where the corrupted etcd member is running. You can do that by running the following command:

kubectl get pods -o wide -n kube-system -l component=etcd

The node name can be found in the NODE column. Write it down as you’ll need it for other commands.

NAME                                                READY   STATUS    RESTARTS   AGE    IP               NODE                                           NOMINATED NODE   READINESS GATES
etcd-ip-172-31-195-53.eu-west-3.compute.internal    1/1     Running   0          108m   172.31.195.53    ip-172-31-195-53.eu-west-3.compute.internal    <none>           <none>
etcd-ip-172-31-196-114.eu-west-3.compute.internal   1/1     Running   0          92m    172.31.196.114   ip-172-31-196-114.eu-west-3.compute.internal   <none>           <none>
etcd-ip-172-31-197-44.eu-west-3.compute.internal    1/1     Running   0          89m    172.31.197.44    ip-172-31-197-44.eu-west-3.compute.internal    <none>           <none>

For the purpose of this guide, we’ll consider that etcd-ip-172-31-196-114.eu-west-3.compute.internal is a corrupted etcd member, and the node where this member is running is ip-172-31-196-114.eu-west-3.compute.internal.

You’ll also need the IP address, so you can SSH to the node. You can find the IP address by checking the Terraform state or with kubectl. Depending on your setup, you might need to use a bastion host to access the node (in which case you can find the bastion IP address in the Terraform state).

Drain the node, so all pods get rescheduled to other nodes.

kubectl drain --ignore-daemonsets --delete-emptydir-data <node-name>

You should see output such as the following one:

node/ip-172-31-196-114.eu-west-3.compute.internal cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/canal-dg7bk, kube-system/ebs-csi-node-ldjq9, kube-system/kube-proxy-k9gkv, kube-system/node-local-dns-2cqxm
node/ip-172-31-196-114.eu-west-3.compute.internal drained

Once done, SSH to the node:

ssh <username>@<ip-address>
ssh -J <bastion-username>@<bastion-ip> <username>@<ip-address> # if running behind a bastion host (jumphost)

Reset the node by running the kubeadm reset command:

sudo kubeadm reset --force

After that is done, you can close the SSH session. You’ll need to manually remove the Node object before proceeding.

kubectl delete node <node-name>
node "ip-172-31-196-114.eu-west-3.compute.internal" deleted

Finally, you can run kubeone apply to rejoin the node:

kubeone apply -m kubeone.yaml -t tf.json

KubeOne should confirm that the node will be joined to the cluster:

The following actions will be taken:
Run with --verbose flag for more information.
        + join control plane node "ip-172-31-196-114.eu-west-3.compute.internal" (172.31.196.114) using 1.23.5
        + ensure machinedeployment "<cluster-name>-eu-west-3a" with 1 replica(s) exists
        + ensure machinedeployment "<cluster-name>-eu-west-3b" with 1 replica(s) exists
        + ensure machinedeployment "<cluster-name>-eu-west-3c" with 1 replica(s) exists

If that’s the case, type yes to proceed. Once KubeOne is done, run kubectl get nodes to confirm that the node has joined the cluster.

With that done, your cluster is recovered. Since it’s still running etcd v3.5, you should continue monitoring your etcd ring. If you encounter any issues along the way, please create an issue in the KubeOne repository.