Table of Contents
Upgrading AKS using Terraform
Summary: In this post I'll show you how to upgrade your AKS cluster using Terraform.
Date: 23 February 2025
Before we can start doing an upgrade, it's always good to gather some information about the current state of the cluster, as well as information about the new version and possible problems we might run into. Let's start with some links to the release notes and such and then continue with some commands to gather information.
It is also important to know, that when following this post, we will upgrade the following components in this order:
- Backplanes
- System nodepool
- User nodepool
Getting Info
Once you're checked the versions and release notes and you're sure you want to continue, you can start with the following checks:
First we check for available versions in our region. The output of the following command can be shown in a nice table, so we can actually check to which version we want to upgrade.
az aks get-versions --location westeurope --output table KubernetesVersion Upgrades SupportPlan ------------------- --------------------------------------------------------------------------------- -------------------------------------- 1.31.3 None available KubernetesOfficial 1.31.2 1.31.3 KubernetesOfficial 1.31.1 1.31.2, 1.31.3 KubernetesOfficial 1.30.7 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport 1.30.6 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport 1.30.5 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport 1.30.4 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport 1.30.3 1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport 1.30.2 1.30.3, 1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport 1.30.1 1.30.2, 1.30.3, 1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport 1.30.0 1.30.1, 1.30.2, 1.30.3, 1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3 KubernetesOfficial, AKSLongTermSupport
Note that the output is shortened for readability.
So now we know which versions are available, we can check for possible upgrades for our specific cluster:
az aks get-upgrades --resource-group rg-privatecluster --name aks-privatecluster --output table Name ResourceGroup MasterVersion Upgrades ------- ----------------- --------------- ------------------------------------------------------------------------------------ default rg-privatecluster 1.27.7 1.28.0, 1.28.3, 1.28.5, 1.28.9, 1.28.10, 1.28.11, 1.28.12, 1.28.13, 1.28.14, 1.28.15
Check the available upgrades. Our current versions is 1.27, so we only can go to 1.28. It's not possible to skip the minor versions.
Now we need one more check, pod disruption budgets. These can block the upgrade process, so it's good to check them before starting the upgrade:
Check for pod disruption budgets:
azadmin@vm-jumpbox:~$ kubectl get pdb -A NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE postgresql 1 N/A 1 199d postgresql-primary 1 N/A 0 199d
Check the pod disruption budgets. The second pdb will block the upgrade as there are no disruptions allowed.
As a last check, we will note the current node status:
azadmin@vm-jumpbox:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME aks-system-34974014-vmss000000 Ready agent 271d v1.27.7 172.16.48.103 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-system-34974014-vmss000001 Ready agent 271d v1.27.7 172.16.48.5 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-system-34974014-vmss000002 Ready agent 270d v1.27.7 172.16.48.54 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-user-17016665-vmss000001 Ready agent 270d v1.27.7 172.16.64.4 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-user-17016665-vmss000002 Ready agent 270d v1.27.7 172.16.64.102 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-user-17016665-vmss00001r Ready agent 200d v1.27.7 172.16.64.53 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-user-17016665-vmss00001t Ready agent 179d v1.27.7 172.16.64.249 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-user-17016665-vmss00001u Ready agent 176d v1.27.7 172.16.65.42 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-user-17016665-vmss00001v Ready agent 172d v1.27.7 172.16.64.151 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-user-17016665-vmss00001x Ready agent 136d v1.27.7 172.16.65.91 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-user-17016665-vmss00001y Ready agent 31d v1.27.7 172.16.65.189 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-user-17016665-vmss00001z Ready agent 31d v1.27.7 172.16.65.140 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-user-17016665-vmss000020 Ready agent 8d v1.27.7 172.16.64.200 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1
As you can see, it's been a while since the last upgrade. Now we can start with the actual upgrade.
Upgrade Using Terraform
We have some modules in place which manage the aks cluster. Both the cluster and the nodepools use the same variable as the version, so we only need to change it in one place.
# Variables file # Azure Kubernetes Service (AKS) kubernetes_version = "1.27.7" # Module aks, shortened for readability resource "azurerm_kubernetes_cluster" "aks_cluster" { name = var.name location = var.location resource_group_name = var.resource_group_name node_resource_group = var.node_resource_group_name kubernetes_version = var.kubernetes_version default_node_pool { name = var.default_node_pool_name orchestrator_version = var.kubernetes_version } } # Module node_pool, shortened for readability resource "azurerm_kubernetes_cluster_node_pool" "node_pool" { kubernetes_cluster_id = var.kubernetes_cluster_id name = var.name orchestrator_version = var.orchestrator_version }
Now we can change the version in the variables file:
# Variables file # Azure Kubernetes Service (AKS) kubernetes_version = "1.28.15"
Timeout Considerations
Upgrading an aks cluster can be time consuming, especially on larger clusters. When doing all upgrades at once as shown above, an upgrade with about 10 nodes can take anything from 1 to multiple hours. In our case we were doing the terraform aks upgrade using an Azure DevOps Pipeline, so we had to change multiple timeouts to ensure a smooth upgrade.
Azure DevOps Pipeline Timeouts
Note that changing the timeouts in an Azure DevOps pipeline either requires a paid offering or a self-hosted agent.
There are two timeouts to set. First, you need to set the timeout of the job, as well as the timout of the task. I prefer a timeout of 0 for the job, and then one that's appropriate for the task. In the example below I've set it to 3 hours, which should be enough for most upgrades.
- stage: terraform_plan_apply displayName: 'Terraform Plan or Apply' jobs: - job: terraform_plan_apply displayName: 'Terraform Plan or Apply' timeoutInMinutes: 0 steps: - task: AzureCLI@2 displayName: 'Terraform Apply' timeoutInMinutes: 180 inputs: azureSubscription: '$(backendServiceArm)' scriptType: 'bash' scriptLocation: 'inlineScript' inlineScript: | terraform apply \ -var-file=env/dev.tfvars \ -compact-warnings \ -input=false \ -auto-approve workingDirectory: $(workingDirectory)
Note that the pipeline example has tasks removed for readability. See Terraform in Azure DevOps for various working examples of Azure DevOps pipelines for terraform.
Terraform Timeout
Terraform also has timeouts, which can be changed for some resources. The Terraform registry shows if a resource timeout can be configured. In our case, the user node pool could take a long time, so we've set the timeout here as well.
# Module node_pool, shortened for readability resource "azurerm_kubernetes_cluster_node_pool" "node_pool" { kubernetes_cluster_id = var.kubernetes_cluster_id name = var.name orchestrator_version = var.orchestrator_version timeouts { create = "2h" update = "2h" } }
This prevents the terraform errorpolling after CreateOrUpdate: context deadline exceeded
Pod Disruption Budgets
As mentioned before, pod disruption budgets can block the upgrade process. If you have a pdb that blocks the upgrade, you can either delete it or change the settings. In our case, we had a pdb that blocked the upgrade, so we got the following terraform error:
"message": "Upgrade is blocked due to invalid Pod Disruption Budgets (PDBs). Please review the PDB spec to allow disruptions during upgrades. To bypass this error, set forceUpgrade in upgradeSettings.overrideSettings. Bypassing this error without updating the PDB may result in drain failures during upgrade process. Invalid PDBs details: 1 error occurred:\n\t* PDB dev/postgresql-primary has minAvailable(1) \u003e= expectedPods(1) can't proceed with put operation\n\n",
In our case, we decided to force the upgrade. This is by done by setting a temporary upgrade override. This can be done using the Azure CLI:
az aks update --name aks-privatecluster --resource-group rg-privatecluster --enable-force-upgrade --upgrade-override-until 2025-02-24T18:00:00Z`
You can check these settings by quering the cluster using the Azure CLI:
azadmin@vm-jumpbox:~$ az aks show --resource-group rg-privatecluster --name aks-privatecluster --query upgradeSettings { "overrideSettings": { "forceUpgrade": true, "until": "2025-02-24T18:00:00+00:00" } }
Terraform Plan
Now the output of terraform plan will be subject to your environment. Ideally, it will show you the upgrade of the cluster and the nodepools. However, some Azure resources might be dependent on the cluster and will need a refresh or update. This can cause a lot of changes in the plan:
Plan: 19 to add, 5 to change, 19 to destroy.
In our case, this was because we've enabled workload identity in the cluster, and all the extra updates are because of the workload identities and the federated credentials. Normally, this all goes well, so no worries and continue. We had one issue with a federated credential, which is shown below at troubleshooting.
Terraform Apply
During the terraform apply, the cluster will be upgraded first, followed by the system nodepool and then the user nodepool. You can monitor the upgrade by checking the node status:
azadmin@vm-jumpbox:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME aks-system-34974014-vmss000000 Ready agent 76m v1.28.15 172.16.48.103 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-system-34974014-vmss000001 Ready agent 73m v1.28.15 172.16.48.5 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-system-34974014-vmss000002 Ready agent 67m v1.28.15 172.16.48.54 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-user-17016665-vmss000001 Ready agent 54m v1.28.15 172.16.64.4 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-user-17016665-vmss000002 Ready agent 43m v1.28.15 172.16.64.102 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-user-17016665-vmss00001r Ready agent 38m v1.28.15 172.16.64.53 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-user-17016665-vmss00001t Ready agent 33m v1.28.15 172.16.64.249 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-user-17016665-vmss00001u Ready agent 27m v1.28.15 172.16.65.42 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-user-17016665-vmss00001v Ready agent 12m v1.28.15 172.16.64.151 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-user-17016665-vmss00001x Ready agent 7m51s v1.28.15 172.16.65.91 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-user-17016665-vmss00001y Ready agent 4m2s v1.28.15 172.16.65.189 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-user-17016665-vmss00001z Ready agent 25s v1.28.15 172.16.65.140 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1 aks-user-17016665-vmss000020 Ready,SchedulingDisabled agent 8d v1.27.7 172.16.64.200 <none> Ubuntu 22.04.4 LTS 5.15.0-1061-azure containerd://1.7.15-1 aks-user-17016665-vmss000022 Ready agent 62m v1.28.15 172.16.65.238 <none> Ubuntu 22.04.5 LTS 5.15.0-1079-azure containerd://1.7.25-1
You can also check the status of the upgrade in the Azure CLI:
azadmin@vm-jumpbox:~$ az aks nodepool show --resource-group rg-privatecluster --cluster-name aks-privatecluster --name user --output table Name OsType KubernetesVersion VmSize Count MaxPods ProvisioningState Mode ------ -------- ------------------- --------------- ------- --------- ------------------- ------ user Linux 1.28.15 Standard_D4s_v3 11 50 Upgrading User
Once the upgrade is finished, all nodes have their version upgraded and the nodepool ProvisioningState is Succeeded:
azadmin@vm-jumpbox:~$ az aks nodepool show --resource-group rg-privatecluster --cluster-name aks-privatecluster --name user --output table Name OsType KubernetesVersion VmSize Count MaxPods ProvisioningState Mode ------ -------- ------------------- --------------- ------- --------- ------------------- ------ user Linux 1.28.15 Standard_D4s_v3 10 50 Succeeded User
Troubleshooting
As mentioned before, we had an issue with a federated credential. This was probably caused because of the timeouts we encountered. Once the upgrade was done, we checked if everything was ok by running a terraform plan and we got the message that one of the federated credentials was missing, but when running the apply, it said the resource already existed. We fixed this by importing the resource:
# Import federated credential terraform import -var-file=env/dev.tfvars \ 'module.federated_identity_credentials["fc-grafana"].azurerm_federated_identity_credential.federated_identity_credential' \ /subscriptions/30b3c71d-a123-a123-a123-abcd12345678/resourceGroups/rg-privatecluster/providers/Microsoft.ManagedIdentity/userAssignedIdentities/id-workload-grafana/federatedIdentityCredentials/fc-grafana