Troubleshooting

This guide helps you diagnose and resolve common issues with machine-controller.

General Debugging

Check Machine-Controller Logs

View the machine-controller logs to identify errors:

kubectl logs -n kube-system deployment/machine-controller -f

For more verbose logging, increase the log level by editing the deployment:

kubectl edit deployment machine-controller -n kube-system

Change the -v flag to a higher value (e.g., -v=6 for debug level).

Inspect Machine Status

Check the status of a specific machine:

kubectl describe machine <machine-name> -n kube-system

Look for:

Status conditions: Indicates provisioning state
Events: Shows recent operations and errors
Provider status: Cloud provider-specific information

List All Machines

kubectl get machines -n kube-system -o wide

Check for machines stuck in provisioning or with error states.

Common Issues and Solutions

Machine Stuck in “Provisioning” State

Symptoms:

Machine remains in provisioning state for extended period
Node doesn’t appear in kubectl get nodes

Possible Causes and Solutions:

Cloud Provider Credentials Invalid
```
kubectl logs -n kube-system deployment/machine-controller | grep -i auth
```
Solution: Verify credentials are correct and have necessary permissions
Instance Creation Failure
```
kubectl describe machine <machine-name> -n kube-system
```
Check events for cloud provider errors (quota limits, invalid instance type, etc.)
Network Connectivity Issues
- Ensure security groups/firewall rules allow required traffic
- Verify subnet has internet access for downloading packages
- Check if cloud-init can reach necessary endpoints
User Data Script Errors Access the instance via cloud provider console and check:
```
sudo journalctl -u cloud-init-output
```

Machine Creation Fails Immediately

Symptoms:

Machine enters error state quickly
Events show validation or creation errors

Common Solutions:

Invalid Configuration
```
kubectl get machine <machine-name> -n kube-system -o yaml
```
Verify all required fields are present and valid
Unsupported Operating System Check the operating system support matrix
Cloud Provider Quota Exceeded
- Check cloud provider dashboard for quota limits
- Request quota increase if needed

Node Not Joining Cluster

Symptoms:

Cloud instance is created successfully
Instance appears in cloud provider console
Node doesn’t appear in kubectl get nodes

Debugging Steps:

Check kubelet status on the instance SSH into the instance:
```
systemctl status kubelet
journalctl -u kubelet -f
```

Verify bootstrap token Check if the token is valid:

kubectl get secrets -n kube-system | grep bootstrap-token

Check network connectivity From the instance, test connectivity to API server:
```
curl -k https://<api-server>:6443
```

Review cloud-init logs

sudo cat /var/log/cloud-init.log
sudo cat /var/log/cloud-init-output.log

Machine Stuck in “Deleting” State

Symptoms:

Machine remains in deleting state
Cloud instance may or may not exist

Solutions:

Check for finalizers

kubectl get machine <machine-name> -n kube-system -o yaml | grep finalizers

Force delete if necessary (use with caution)

kubectl patch machine <machine-name> -n kube-system -p '{"metadata":{"finalizers":[]}}' --type=merge

Manually delete cloud resources If cloud instance still exists, delete it via cloud provider console/CLI

MachineDeployment Not Creating Machines

Symptoms:

MachineDeployment exists but no MachineSets or Machines are created

Solutions:

Check MachineDeployment events

kubectl describe machinedeployment <name> -n kube-system

Verify selector matches template labels

spec:
  selector:
    matchLabels:
      name: my-workers  # Must match template labels
  template:
    metadata:
      labels:
        name: my-workers

Check for validation errors Look for events indicating schema validation failures

Rolling Update Stuck

Symptoms:

MachineDeployment update doesn’t complete
Some old machines remain running

Solutions:

Check update strategy

kubectl get machinedeployment <name> -n kube-system -o yaml

Verify maxSurge and maxUnavailable settings

Check machine creation errors New machines might be failing to provision:
```
kubectl get machines -n kube-system | grep <deployment-name>
```
Manually delete problematic machines If machines are stuck, delete them to allow new ones to be created

Cloud Provider Specific Issues

AWS

Issue: Instance creation fails with “unauthorized” error

Verify IAM user/role has correct permissions
Check if instance profile is properly configured
Ensure AWS credentials are correctly set

Issue: Instances created in wrong subnet

Verify subnetId in cloud provider spec
Check if subnet exists in specified availability zone

Azure

Issue: Authentication failures

Verify tenantID, clientID, clientSecret, and subscriptionID
Ensure service principal has contributor role on resource group

Issue: VM size not available

Check VM size availability in the specified region
Use az vm list-sizes --location <region> to see available sizes

DigitalOcean

Issue: Rate limiting errors

DigitalOcean API has rate limits
Reduce machine-controller worker count if hitting limits

Issue: Droplet creation fails with “region not available”

Verify region slug is correct
Check if desired droplet size is available in that region

Google Cloud Platform

Issue: Service account decoding errors

Ensure service account JSON is properly base64 encoded
Use cat sa.json | base64 -w0 (Linux) or cat sa.json | base64 (macOS)

Issue: Quota exceeded errors

Check GCP quotas in the console
Request quota increase if needed

Hetzner Cloud

Issue: Location or server type not found

Verify location and server type names are correct
Use Hetzner Cloud API or CLI to list available options

Issue: Network attachment fails

Ensure network exists in the same location as the server
Verify network ID is correct

OpenStack

Issue: Authentication failures

Verify all OpenStack credentials are correct
Check if domain and project/tenant names match

Issue: Flavor or image not found

Ensure flavor and image IDs/names are valid in your OpenStack deployment
Check if user has permissions to access these resources

VMware vSphere

Issue: VM creation fails

Verify vSphere credentials and datacenter configuration
Check if template/image exists and is accessible
Ensure sufficient resources (CPU, memory, storage) are available

Issue: Network configuration errors

Verify network name matches vSphere configuration
Check if IP address pool (if using static IPs) has available addresses

Performance Issues

Slow Machine Provisioning

Possible Causes:

Cloud Provider API Rate Limits
- Reduce machine-controller worker count
- Implement backoff strategies
Low Worker Count Increase workers in machine-controller deployment:
```
kubectl edit deployment machine-controller -n kube-system
```
Change -worker-count flag to a higher value (e.g., -worker-count=20)
Slow Image Downloads
- Use images closer to your cloud provider region
- Consider pre-baking images with required packages

High Memory or CPU Usage

Solutions:

Reduce worker count if managing too many concurrent operations
Increase resource limits on machine-controller deployment
Check for leaked resources in cloud provider

Debugging Techniques

Enable Debug Logging

Edit machine-controller deployment:

kubectl edit deployment machine-controller -n kube-system

Change logging level:

args:
- -logtostderr
- -v=6  # Debug level

Collect Diagnostic Information

Create a diagnostic bundle:

# Machine-controller logs
kubectl logs -n kube-system deployment/machine-controller --tail=1000 > mc-logs.txt

# All machines
kubectl get machines -n kube-system -o yaml > machines.yaml

# All machinesets
kubectl get machinesets -n kube-system -o yaml > machinesets.yaml

# All machinedeployments
kubectl get machinedeployments -n kube-system -o yaml > machinedeployments.yaml

# Events
kubectl get events -n kube-system --sort-by='.lastTimestamp' > events.txt

Getting Help

If you’re still experiencing issues:

Check GitHub Issues: Search existing issues
Open a New Issue: Provide:
- Machine-controller version
- Kubernetes version
- Cloud provider and version
- Relevant logs and error messages
- Machine YAML (sanitized)
Community Support: Join the Kubermatic Slack

Preventive Measures

Test in staging before production deployments
Monitor cloud provider quotas and limits
Set up alerts for machine provisioning failures
Keep machine-controller updated to latest stable version
Document custom configurations for team reference
Regular audit of MachineDeployments and cloud resources
Implement proper RBAC to control machine creation

Troubleshooting

General Debugging

Check Machine-Controller Logs

Inspect Machine Status

List All Machines

Common Issues and Solutions

Machine Stuck in “Provisioning” State

Machine Creation Fails Immediately

Node Not Joining Cluster

Machine Stuck in “Deleting” State

MachineDeployment Not Creating Machines

Rolling Update Stuck

Cloud Provider Specific Issues

AWS

Azure

DigitalOcean

Google Cloud Platform

Hetzner Cloud

OpenStack

VMware vSphere

Performance Issues

Slow Machine Provisioning

High Memory or CPU Usage

Debugging Techniques

Enable Debug Logging

Collect Diagnostic Information

Getting Help

Preventive Measures

Additional Resources