This version is under construction, please use an official release version

Known Issues

This page documents the list of known issues in Kubermatic KubeOne along with possible workarounds and recommendations.

This list applies to KubeOne 1.8 release. For KubeOne 1.7, please consider the v1.7 version of this document. For earlier releases, please consult the appropriate changelog.

Invalid cluster name set on OpenStack CCM and OpenStack Cinder CSI

StatusFixed in KubeOne 1.7.2
SeverityCritical
GitHub issuehttps://github.com/kubermatic/kubeone/issues/2976

Who’s affected by this issue?

This issue affects only OpenStack clusters. The following OpenStack users are affected by this issue:

  • Users who provisioned their OpenStack cluster with KubeOne 1.6 or earlier, then upgraded to KubeOne 1.7.0 or 1.7.1 and ran kubeone apply two or more times
  • Users who used KubeOne 1.7.0 or 1.7.1 to provision their OpenStack cluster

Description

The OpenStack CCM and Cinder CSI are taking the cluster name property which is used upon creating OpenStack Load Balancers and Volumes. The cluster name property is provided as a flag on OpenStack CCM DaemonSet and Cinder CSI Controller Deployment. This cluster name property is used:

  • for naming Octavia Load Balancers and Load Balancer Listeners
  • for tagging Volumes with the cinder.csi.openstack.org/cluster tag

Due to a bug introduced in KubeOne 1.7.0, the cluster name property is unconditionally set to kubernetes instead of the desired cluster’s name. As a result:

  • Existing Octavia Load Balancers will fail to reconcile
  • Newly created Load Balancers will have incorrect name
  • Volumes should not be affected besides the cinder.csi.openstack.org/cluster tag having a wrong value

What’s considered a valid/desired cluster name?

In general, the cluster name property must equal to the cluster name provided to KubeOne (either via KubeOneCluster manifest (kubeone.yaml by default) or via the cluster_name Terraform variable). This is especially important if you have multiple Kubernetes clusters in the same OpenStack project.

How to check if you’re affected?

You might be affected only if you’re using KubeOne 1.7.

Run the following kubectl command, with kubectl pointing to your potentially affected cluster:

kubectl get daemonset \
    --namespace kube-system \
    openstack-cloud-controller-manager \
    --output=jsonpath='{.spec.template.spec.containers[?(@.name=="openstack-cloud-controller-manager")].env[?(@.name=="CLUSTER_NAME")].value}'

If you get the following output:

  • kubernetes: you’re affected by this issue
  • a valid cluster name (as described in the previous section): you’re NOT affected by this issue
  • if you don’t get anything, you’re mostly like not running KubeOne 1.7 yet

Regardless if you’re affected or not, we strongly recommend upgrading to KubeOne 1.7.2 or newer as soon as possible!

Mitigation

If you’re affected by this issue, we strongly recommend taking the mitigation steps.

Please be aware that changing the cluster name might make some Octavia Load Balancers fail to reconcile. Volumes shouldn’t be affected.

First, determine your desired cluster name. The safest way is to dump the whole KubeOneCluster manifest using the kubeone config dump command (make sure to replace tf.json and kubeone.yaml with valid files before running the command):

kubeone config dump -t tf.json -m kubeone.yaml | grep "name:"

You’ll get output such as:

  - name: default-storage-class
    hostname: test-1-cp-0
    sshUsername: ubuntu
    hostname: test-1-cp-1
    sshUsername: ubuntu
    hostname: test-1-cp-2
    sshUsername: ubuntu
- name: test-1-pool1
name: test-1

Note the top-level name value, in this case test-1 – this is your desired cluster name.

The next step is to patch the OpenStack CCM DaemonSet and Cinder CSI Deployment (replace <<REPLACE_ME>> with your cluster’s name in the following two commands):

kubectl patch --namespace kube-system daemonset openstack-cloud-controller-manager --type='strategic' --patch='
{
  "spec": {
    "template": {
      "spec": {
        "containers": [
          {
            "name": "openstack-cloud-controller-manager",
            "env": [
              {
                "name": "CLUSTER_NAME",
                "value": "<<REPLACE_ME>>"
              }
            ]
          }
        ]
      }
    }
  }
}'
kubectl patch --namespace kube-system deployment openstack-cinder-csi-controllerplugin --type='strategic' --patch='
{
  "spec": {
    "template": {
      "spec": {
        "containers": [
          {
            "name": "cinder-csi-plugin",
            "env": [
              {
                "name": "CLUSTER_NAME",
                "value": "<<REPLACE_ME>>"
              }
            ]
          }
        ]
      }
    }
  }
}'

You should see the following output from these two commands:

daemonset.apps/openstack-cloud-controller-manager patched
deployment.apps/openstack-cinder-csi-controllerplugin patched

At this point, you need to remediate errors and failed reconcilations that might be caused by this change. As mentioned earlier, volumes are not affected by this change, but Octavia Load Balancers might be.

The easiest way to determine if you have Load Balancers affected by this change is to look for SyncLoadBalancerFailed events. You can do that using the following command:

kubectl get events --all-namespaces --field-selector reason=SyncLoadBalancerFailed

You might get output like this:

NAMESPACE   LAST SEEN   TYPE      REASON                   OBJECT            MESSAGE
default     2s          Warning   SyncLoadBalancerFailed   service/nginx-2   Error syncing load balancer: failed to ensure load balancer: the listener port 80 already exists
default     4h49m       Warning   SyncLoadBalancerFailed   service/nginx     Error syncing load balancer: failed to ensure load balancer: the listener port 80 already exists
default     3h7m        Warning   SyncLoadBalancerFailed   service/nginx     Error syncing load balancer: failed to ensure load balancer: the listener port 80 already exists
default     89m         Warning   SyncLoadBalancerFailed   service/nginx     Error syncing load balancer: failed to ensure load balancer: the listener port 80 already exists
default     22m         Warning   SyncLoadBalancerFailed   service/nginx     Error syncing load balancer: failed to ensure load balancer: the listener port 80 already exists
default     3m1s        Warning   SyncLoadBalancerFailed   service/nginx     Error syncing load balancer: failed to ensure load balancer: the listener port 80 already exists

Only events that are last seen after you made the cluster name change are relevant. Other events can be ignored, although you might want to describe those Services and ensure that you see EnsuredLoadBalancer event.

For Services that are showing SyncLoadBalancerFailed, you will need to take steps depending on the error message. For example, if the error message is the listener port 80 already exists, you can manually delete the listener and OpenStack CCM will create a valid one again after some time.

KubeOne is unable to upgrade AzureDisk CSI driver upon upgrading KubeOne from 1.6 to 1.7

StatusFixed in KubeOne 1.7.2
SeverityLow
GitHub issuehttps://github.com/kubermatic/kubeone/issues/2971

Who’s affected by this issue?

Users who used KubeOne 1.6 or earlier to provision a cluster running on Microsoft Azure are affected by this issue.

Description

The AzureDisk CSI driver got updated to a newer version in KubeOne 1.7. This upgrade accidentally changed the csi-azuredisk-node-secret-binding ClusterRoleBinding object so that the referenced role (roleRef) is csi-azuredisk-node-role instead of csi-azuredisk-node-secret-role. Given that the referenced role is immutable, KubeOne wasn’t able to upgrade the AzureDisk CSI driver when upgrading KubeOne from 1.6 to 1.7.0 or 1.7.1.

Recommendation

If you’re affected by this issue, it’s recommended to upgrade to KubeOne 1.7.2 or newer. KubeOne 1.7.2 removes the csi-azuredisk-node-secret-binding ClusterRoleBinding object if the referenced role is csi-azuredisk-node-secret-role to allow the upgrade process to proceed.

The issue can also be mitigated manually by removing the ClusterRoleBinding object if KubeOne is stuck trying to upgrade the AzureDisk CSI driver:

kubectl delete clusterrolebinding csi-azuredisk-node-secret-binding

node-role.kubernetes.io/master taint not removed on upgrade when using KubeOne 1.6.0-rc.1

StatusFixed in KubeOne 1.6.0
SeverityCritical
GitHub issuehttps://github.com/kubermatic/kubeone/pull/2688

Users who:

  • used KubeOne 1.6.0-rc.1 or built KubeOne manually on commit up to 8291a9f, AND
  • provisioned clusters running Kubernetes 1.25 OR upgraded clusters running Kubernetes 1.24 to Kubernetes 1.25

are affected by this issue.

Description

Kubernetes removed the node-role.kubernetes.io/master taint in 1.25. However, we had a bug in KubeOne that enforced this taint up until Kubernetes 1.26. Even if we don’t put that taint for 1.26 clusters, Kubeadm is not going to remove it upon upgrading to 1.26. That’s because the migration logic that was removing that taint has been already removed in 1.26.

Recommendation

If you’re affected by this issue, you have to manually untaint affected control plane nodes. You can do that by using the following command:

kubectl taint nodes node-role.kubernetes.io/master- --all

Not doing so might cause a major outage as we (both KubeOne and Kubeadm) stop tolerating the node-role.kubernetes.io/master taint.

Cilium CNI is not working on clusters running CentOS 7

StatusKnown Issue
SeverityLow
GitHub issueN/A

Description

Cilium CNI is not supported on CentOS 7 because it’s using too older kernel version which is not supported by Cilium itself. For more details, consider the official Cilium documentation.

Recommendation

Please consider using an operating system with a newer kernel version, such as Ubuntu, Rocky Linux, and Flatcar. See the official Cilium documentation for a list of operating systems and versions supported by Cilium.

Pod connectivity is broken for Calico VXLAN clusters

StatusBeing Investigated
SeverityHigh for clusters using Calico VXLAN addon
GitHub issuehttps://github.com/kubermatic/kubeone/issues/2192

Description

Clusters running Calico VXLAN might not be able to reach ClusterIP Services from a node where the pod is running.

Recommendation

We do NOT recommend upgrading to KubeOne 1.6 and 1.5 at this time if you’re using Calico VXLAN. Follow the linked GitHub issue and this page for updates.

KubeOne is failing to provision a cluster on upgraded Flatcar VMs

StatusWorkaround available
SeverityLow
GitHub issuehttps://github.com/kubermatic/kubeone/issues/2318

Description

KubeOne is failing to provision a cluster on Flatcar VMs that are upgraded from a version prior to 2969.0.0 to a newer version. This only affects VMs that were never used with KubeOne; existing KubeOne clusters are not affected by this issue.

Recommendation

If you’re affected by this issue, we recommend creating VMs with a newer Flatcar version or following the cgroups v2 migration instructions.

Internal Kubernetes endpoints unreachable on vSphere with Cilium/Canal

StatusWorkaround available
SeverityLow
GitHub issuehttps://github.com/cilium/cilium/issues/21801

Description

Symptoms

  • Unable to perform CRUD operations on resources governed by webhooks (e.g. ValidatingWebhookConfiguration, MutatingWebhookConfiguration, etc.). The following error is observed:
Internal error occurred: failed calling webhook "webhook-name": failed to call webhook: Post "https://webhook-service-name.namespace.svc:443/webhook-endpoint": context deadline exceeded
  • Unable to reach internal Kubernetes endpoints from pods/nodes.
  • ICMP is working but TCP/UDP is not.

Cause

On recent enough VMware hardware compatibility version (i.e >=15 or maybe >=14), CNI connectivity breaks because of hardware segmentation offload. cilium-health status has ICMP connectivity working, but not TCP connectivity. cilium-health status may also fail completely.

Recommendation

sudo ethtool -K ens192 tx-udp_tnl-segmentation off
sudo ethtool -K ens192 tx-udp_tnl-csum-segmentation off

These flags are related to the hardware segmentation offload done by the vSphere driver VMXNET3. We have observed this issue for both Cilium and Canal CNI running on Ubuntu 22.04.

We have two options to configure these flags for KubeOne installations:

References