SHIFT-WIKI

--- Sjoerd Hooft's InFormation Technology ---

User Tools

Site Tools


aksupgrade

Upgrading AKS using Terraform

Summary: In this post I'll show you how to upgrade your AKS cluster using Terraform.
Date: 23 February 2025

Before we can start doing an upgrade, it's always good to gather some information about the current state of the cluster, as well as information about the new version and possible problems we might run into. Let's start with some links to the release notes and such and then continue with some commands to gather information.

It is also important to know, that when following this post, we will upgrade the following components in this order:

  1. Backplanes
  2. System nodepool
  3. User nodepool

Getting Info

Once you're checked the versions and release notes and you're sure you want to continue, you can start with the following checks:

First we check for available versions in our region. The output of the following command can be shown in a nice table, so we can actually check to which version we want to upgrade.

az aks get-versions --location westeurope --output table
 
KubernetesVersion    Upgrades                                                                           SupportPlan
-------------------  ---------------------------------------------------------------------------------  --------------------------------------
1.31.3               None available                                                                     KubernetesOfficial
1.31.2               1.31.3                                                                             KubernetesOfficial
1.31.1               1.31.2, 1.31.3                                                                     KubernetesOfficial
1.30.7               1.31.1, 1.31.2, 1.31.3                                                             KubernetesOfficial, AKSLongTermSupport
1.30.6               1.30.7, 1.31.1, 1.31.2, 1.31.3                                                     KubernetesOfficial, AKSLongTermSupport
1.30.5               1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3                                             KubernetesOfficial, AKSLongTermSupport
1.30.4               1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3                                     KubernetesOfficial, AKSLongTermSupport
1.30.3               1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3                             KubernetesOfficial, AKSLongTermSupport
1.30.2               1.30.3, 1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3                     KubernetesOfficial, AKSLongTermSupport
1.30.1               1.30.2, 1.30.3, 1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3             KubernetesOfficial, AKSLongTermSupport
1.30.0               1.30.1, 1.30.2, 1.30.3, 1.30.4, 1.30.5, 1.30.6, 1.30.7, 1.31.1, 1.31.2, 1.31.3     KubernetesOfficial, AKSLongTermSupport
Note that the output is shortened for readability.

So now we know which versions are available, we can check for possible upgrades for our specific cluster:

az aks get-upgrades --resource-group rg-privatecluster --name aks-privatecluster --output table
 
Name     ResourceGroup      MasterVersion    Upgrades
-------  -----------------  ---------------  ------------------------------------------------------------------------------------
default  rg-privatecluster  1.27.7           1.28.0, 1.28.3, 1.28.5, 1.28.9, 1.28.10, 1.28.11, 1.28.12, 1.28.13, 1.28.14, 1.28.15

Check the available upgrades. Our current versions is 1.27, so we only can go to 1.28. It's not possible to skip the minor versions.

Now we need one more check, pod disruption budgets. These can block the upgrade process, so it's good to check them before starting the upgrade:

Check for pod disruption budgets:

azadmin@vm-jumpbox:~$ kubectl get pdb -A
NAME                 MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
postgresql           1               N/A               1                     199d
postgresql-primary   1               N/A               0                     199d

Check the pod disruption budgets. The second pdb will block the upgrade as there are no disruptions allowed.

As a last check, we will note the current node status:

azadmin@vm-jumpbox:~$ kubectl get nodes -o wide
NAME                             STATUS   ROLES   AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-system-34974014-vmss000000   Ready    agent   271d   v1.27.7   172.16.48.103   <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-system-34974014-vmss000001   Ready    agent   271d   v1.27.7   172.16.48.5     <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-system-34974014-vmss000002   Ready    agent   270d   v1.27.7   172.16.48.54    <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-user-17016665-vmss000001     Ready    agent   270d   v1.27.7   172.16.64.4     <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-user-17016665-vmss000002     Ready    agent   270d   v1.27.7   172.16.64.102   <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-user-17016665-vmss00001r     Ready    agent   200d   v1.27.7   172.16.64.53    <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-user-17016665-vmss00001t     Ready    agent   179d   v1.27.7   172.16.64.249   <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-user-17016665-vmss00001u     Ready    agent   176d   v1.27.7   172.16.65.42    <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-user-17016665-vmss00001v     Ready    agent   172d   v1.27.7   172.16.64.151   <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-user-17016665-vmss00001x     Ready    agent   136d   v1.27.7   172.16.65.91    <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-user-17016665-vmss00001y     Ready    agent   31d    v1.27.7   172.16.65.189   <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-user-17016665-vmss00001z     Ready    agent   31d    v1.27.7   172.16.65.140   <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-user-17016665-vmss000020     Ready    agent   8d     v1.27.7   172.16.64.200   <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1

As you can see, it's been a while since the last upgrade. Now we can start with the actual upgrade.

Upgrade Using Terraform

We have some modules in place which manage the aks cluster. Both the cluster and the nodepools use the same variable as the version, so we only need to change it in one place.

# Variables file
# Azure Kubernetes Service (AKS)
kubernetes_version                          = "1.27.7"
 
# Module aks, shortened for readability
resource "azurerm_kubernetes_cluster" "aks_cluster" {
  name                             = var.name
  location                         = var.location
  resource_group_name              = var.resource_group_name
  node_resource_group              = var.node_resource_group_name
  kubernetes_version               = var.kubernetes_version
 
  default_node_pool {
    name                         = var.default_node_pool_name
    orchestrator_version         = var.kubernetes_version
  }
}
 
# Module node_pool, shortened for readability
resource "azurerm_kubernetes_cluster_node_pool" "node_pool" {
  kubernetes_cluster_id        = var.kubernetes_cluster_id
  name                         = var.name
  orchestrator_version         = var.orchestrator_version
}

Now we can change the version in the variables file:

# Variables file
# Azure Kubernetes Service (AKS)
kubernetes_version                          = "1.28.15"

Timeout Considerations

Upgrading an aks cluster can be time consuming, especially on larger clusters. When doing all upgrades at once as shown above, an upgrade with about 10 nodes can take anything from 1 to multiple hours. In our case we were doing the terraform aks upgrade using an Azure DevOps Pipeline, so we had to change multiple timeouts to ensure a smooth upgrade.

Azure DevOps Pipeline Timeouts

Note that changing the timeouts in an Azure DevOps pipeline either requires a paid offering or a self-hosted agent.

There are two timeouts to set. First, you need to set the timeout of the job, as well as the timout of the task. I prefer a timeout of 0 for the job, and then one that's appropriate for the task. In the example below I've set it to 3 hours, which should be enough for most upgrades.

  - stage: terraform_plan_apply
    displayName: 'Terraform Plan or Apply'
    jobs:
      - job: terraform_plan_apply
        displayName: 'Terraform Plan or Apply'
        timeoutInMinutes: 0
        steps:
          - task: AzureCLI@2
            displayName:  'Terraform Apply'
            timeoutInMinutes: 180
            inputs:
              azureSubscription: '$(backendServiceArm)'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                terraform apply \
                  -var-file=env/dev.tfvars \
                  -compact-warnings \
                  -input=false \
                  -auto-approve
              workingDirectory: $(workingDirectory)
Note that the pipeline example has tasks removed for readability. See Terraform in Azure DevOps for various working examples of Azure DevOps pipelines for terraform.

Terraform Timeout

Terraform also has timeouts, which can be changed for some resources. The Terraform registry shows if a resource timeout can be configured. In our case, the user node pool could take a long time, so we've set the timeout here as well.

# Module node_pool, shortened for readability
resource "azurerm_kubernetes_cluster_node_pool" "node_pool" {
  kubernetes_cluster_id        = var.kubernetes_cluster_id
  name                         = var.name
  orchestrator_version         = var.orchestrator_version
 
  timeouts {
    create = "2h"
    update = "2h"
  }
}
This prevents the terraform error polling after CreateOrUpdate: context deadline exceeded

Pod Disruption Budgets

As mentioned before, pod disruption budgets can block the upgrade process. If you have a pdb that blocks the upgrade, you can either delete it or change the settings. In our case, we had a pdb that blocked the upgrade, so we got the following terraform error:

"message": "Upgrade is blocked due to invalid Pod Disruption Budgets (PDBs). Please review the PDB spec to allow disruptions during upgrades. To bypass this error, set forceUpgrade in upgradeSettings.overrideSettings. Bypassing this error without updating the PDB may result in drain failures during upgrade process. Invalid PDBs details: 1 error occurred:\n\t* PDB dev/postgresql-primary has minAvailable(1) \u003e= expectedPods(1)  can't proceed with put operation\n\n",

In our case, we decided to force the upgrade. This is by done by setting a temporary upgrade override. This can be done using the Azure CLI:

az aks update --name aks-privatecluster --resource-group rg-privatecluster --enable-force-upgrade --upgrade-override-until 2025-02-24T18:00:00Z`

You can check these settings by quering the cluster using the Azure CLI:

azadmin@vm-jumpbox:~$ az aks show --resource-group rg-privatecluster --name aks-privatecluster --query upgradeSettings
{
  "overrideSettings": {
    "forceUpgrade": true,
    "until": "2025-02-24T18:00:00+00:00"
  }
}
Note that the Microsoft documentation is a bit confusing. The docs here got two commands mixed up. The docs here shows that you need to set the upgrade override using the az aks update command after wich you can continue doing the upgrade, which we're doing with terraform.

Terraform Plan

Now the output of terraform plan will be subject to your environment. Ideally, it will show you the upgrade of the cluster and the nodepools. However, some Azure resources might be dependent on the cluster and will need a refresh or update. This can cause a lot of changes in the plan:

Plan: 19 to add, 5 to change, 19 to destroy.

In our case, this was because we've enabled workload identity in the cluster, and all the extra updates are because of the workload identities and the federated credentials. Normally, this all goes well, so no worries and continue. We had one issue with a federated credential, which is shown below at troubleshooting.

Terraform Apply

During the terraform apply, the cluster will be upgraded first, followed by the system nodepool and then the user nodepool. You can monitor the upgrade by checking the node status:

azadmin@vm-jumpbox:~$ kubectl get nodes -o wide
NAME                             STATUS                     ROLES   AGE     VERSION    INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-system-34974014-vmss000000   Ready                      agent   76m     v1.28.15   172.16.48.103   <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-system-34974014-vmss000001   Ready                      agent   73m     v1.28.15   172.16.48.5     <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-system-34974014-vmss000002   Ready                      agent   67m     v1.28.15   172.16.48.54    <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-user-17016665-vmss000001     Ready                      agent   54m     v1.28.15   172.16.64.4     <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-user-17016665-vmss000002     Ready                      agent   43m     v1.28.15   172.16.64.102   <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-user-17016665-vmss00001r     Ready                      agent   38m     v1.28.15   172.16.64.53    <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-user-17016665-vmss00001t     Ready                      agent   33m     v1.28.15   172.16.64.249   <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-user-17016665-vmss00001u     Ready                      agent   27m     v1.28.15   172.16.65.42    <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-user-17016665-vmss00001v     Ready                      agent   12m     v1.28.15   172.16.64.151   <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-user-17016665-vmss00001x     Ready                      agent   7m51s   v1.28.15   172.16.65.91    <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-user-17016665-vmss00001y     Ready                      agent   4m2s    v1.28.15   172.16.65.189   <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-user-17016665-vmss00001z     Ready                      agent   25s     v1.28.15   172.16.65.140   <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1
aks-user-17016665-vmss000020     Ready,SchedulingDisabled   agent   8d      v1.27.7    172.16.64.200   <none>        Ubuntu 22.04.4 LTS   5.15.0-1061-azure   containerd://1.7.15-1
aks-user-17016665-vmss000022     Ready                      agent   62m     v1.28.15   172.16.65.238   <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1

You can also check the status of the upgrade in the Azure CLI:

azadmin@vm-jumpbox:~$ az aks nodepool show --resource-group rg-privatecluster --cluster-name aks-privatecluster --name user --output table
Name    OsType    KubernetesVersion    VmSize           Count    MaxPods    ProvisioningState    Mode
------  --------  -------------------  ---------------  -------  ---------  -------------------  ------
user    Linux     1.28.15              Standard_D4s_v3  11       50         Upgrading            User

Once the upgrade is finished, all nodes have their version upgraded and the nodepool ProvisioningState is Succeeded:

azadmin@vm-jumpbox:~$ az aks nodepool show --resource-group rg-privatecluster --cluster-name aks-privatecluster --name user --output table
Name    OsType    KubernetesVersion    VmSize           Count    MaxPods    ProvisioningState    Mode
------  --------  -------------------  ---------------  -------  ---------  -------------------  ------
user    Linux     1.28.15              Standard_D4s_v3  10       50         Succeeded            User

Troubleshooting

As mentioned before, we had an issue with a federated credential. This was probably caused because of the timeouts we encountered. Once the upgrade was done, we checked if everything was ok by running a terraform plan and we got the message that one of the federated credentials was missing, but when running the apply, it said the resource already existed. We fixed this by importing the resource:

# Import federated credential
terraform import -var-file=env/dev.tfvars \
  'module.federated_identity_credentials["fc-grafana"].azurerm_federated_identity_credential.federated_identity_credential' \
  /subscriptions/30b3c71d-a123-a123-a123-abcd12345678/resourceGroups/rg-privatecluster/providers/Microsoft.ManagedIdentity/userAssignedIdentities/id-workload-grafana/federatedIdentityCredentials/fc-grafana
aksupgrade.txt · Last modified: 2025/04/21 17:09 by 127.0.0.1