Support for toggle to force delete azurerm_kubernetes_cluster_node_pool #10411

t3mi · 2021-02-01T16:09:09Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

In case of small sized node pool or tolerations on it which prevents re-scheduling of the pods configured with pod disruption budget following error occurs during node pool removal:

Error: waiting for the deletion of Node Pool "vmss" (Managed Kubernetes Cluster "aks-test" / Resource Group "aks-test-rg"): Code="DeleteVMSSAgentPoolFailed" Message="VMSSAgentPoolReconciler retry failed: {\n  \"code\": \"PodDrainFailure\",\n  \"message\": \"Node 'aks-vmss-17400673-vmss000001' failed to be drained with error: 'Drain did not complete pods [<pod-name>] within 10m0s'\"\n }"

New or Affected Resource(s)

azurerm_kubernetes_cluster_node_pool

Potential Terraform Configuration

# Configure the Azure Provider
provider "azurerm" {
  features {
    azurerm_kubernetes_cluster_node_pool {
      disable_eviction = true
    }
  }
}

References

#83307

tombuildsstuff · 2021-02-02T11:23:44Z

hi @t3mi

Thanks for opening this issue.

Unfortunately at this time the Azure API doesn't support a means of draining a Node Pool - as such at this point in time the Provider attempts to destroy the Node Pool using the Delete API and waits for it to complete.

Whilst it's possible that the AKS API may provide the option to configure this in the future, unfortunately it doesn't at this point in time - and since there's no API to support this behaviour at this time, this'd need the AKS Service Team to expose a means of controlling this. As such I'm going to suggest opening an issue on the AKS Repository instead where someone from that team should be able to take a look - and once it's available we can look at integrating this into Terraform.

Thanks!

treyhendon · 2021-02-27T16:16:25Z

Would it be possible to issue a delete on the AKS instance first (before the pools) and then have the pools refresh their TF status before issuing their delete (if they still exist)?

My struggle with this error is with completely tearing down an environment - I get the same error. My end-goal is for all of the assets to go away. In the UI, I can just delete AKS and it deletes the custom node pools for me. It seems like Terraform is issuing the delete command on my custom node pools before issuing the delete command to the AKS instance.

r3mattia · 2021-03-24T10:16:54Z

I can confirm as I am encountering the same issue when destroying a module with:

AKS resource with default node pool
2x additional node pools associated with the cluster

It tries to destroy the node pools and the cluster simultaneously for a few minutes before eventually returning the same error.

r3mattia · 2021-03-24T10:20:19Z

However, it turns out that after re-running terraform destroy shortly after this error, the cluster got destroyed, whilst the node pools remain in the Still destroying state.

treyhendon · 2021-03-24T10:33:42Z

As a work around for others who find this, we added a several kubectl commands to delete all non system/AKS namespaces in the pipeline prior to executing the Terraform step to destroy the environment. It slows the process, but ensures a successful TF destroy.

aslafy-z · 2023-10-23T12:15:33Z

resource "azurerm_kubernetes_cluster_node_pool" "infra" {
[...]

  node_labels = {
    "example.com/cluster-name" = azurerm_kubernetes_cluster.main.name,
    "example.com/resource-group" = azurerm_kubernetes_cluster.main.resource_group_name,
  }

  provisioner "local-exec" {
    when        = destroy
    command     = <<-EOF
      az aks command invoke -g "${self.node_labels["example.com/resource-group"]}" -n "${self.node_labels["example.com/cluster-name"]}" -c "kubectl drain -l 'kubernetes.azure.com/agentpool=${self.name}' --ignore-daemonsets --delete-emptydir-data --timeout=10m || kubectl drain -l 'kubernetes.azure.com/agentpool=${self.name}' --ignore-daemonsets --delete-emptydir-data --timeout=10m --disable-eviction=true"
    EOF

    interpreter = ["/usr/bin/env", "bash", "-c"]
  }

Here's a workaround that uses the AKS Run command primitive that triggers jobs inside the cluster.
Before calling AKS api to destroy, it executes a drain, and after 10 minutes, it some resources are blocked by PDB, drains with forced eviction.
In order for the run command not to fail, I added an annotation to run these jobs on the default node pool by configuring the namespace with

resource "azurerm_kubernetes_cluster" "main" {
[...]

  run_command_enabled = true

  provisioner "local-exec" {
    command     = <<-EOF
      az aks command invoke -g "${self.resource_group_name}" -n "${self.name}" -c "kubectl create ns aks-command && kubectl annotate ns/aks-command scheduler.alpha.kubernetes.io/node-selector=kubernetes.azure.com/agentpool=${self.default_node_pool[0].name}"
    EOF

    interpreter = ["/usr/bin/env", "bash", "-c"]
  }
}

This is clearly not ideal but will do the trick until AKS supports it natively (Azure/AKS#2090).

rcskosir · 2024-03-15T19:54:54Z

Thanks for taking the time to open this issue. It looks like the behavior you requested is not supported by the underlying Azure API so I am going to label this issue as such and close it for now. When it gets added, we can reopen this request or you can create a new one.
Azure/AKS#2090

TomBillietKlarrio · 2024-04-10T14:33:12Z

I'm a bit confused here:

when I delete a nodepool via the Azure UI, it respects the PDB
when I delete a nodepool via Terraform, it ignored the PDB

When I turn on debug logging in terraform, I see this happening on a delete of a nodepool:

2024-04-10T16:22:31.863+0200 [DEBUG] provider.terraform-provider-azurerm_v3.95.0_x5: DELETE https://management.azure.com/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx/agentPools/tom1c50?api-version=2023-06-02-preview&ignore-pod-disruption-budget=true: timestamp=2024-04-10T16:22:31.863+0200

So that api seems to indicate there is a parameter for this?

github-actions · 2024-05-11T02:01:32Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tombuildsstuff added enhancement sdk/not-yet-supported Support for this does not exist in the upstream SDK at this time service/kubernetes-cluster upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR labels Feb 2, 2021

tombuildsstuff added this to the Blocked milestone Feb 2, 2021

This comment has been minimized.

Sign in to view

t3mi mentioned this issue Feb 3, 2021

[Features] Please provide functionality through API to force drain node pool Azure/AKS#2090

Closed

rcskosir added upstream/microsoft/needs-support-on-azure-api This label is applicable when support for a feature is not currently available on the Azure API. and removed upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR labels Mar 15, 2024

rcskosir closed this as not planned Won't fix, can't repro, duplicate, stale Mar 15, 2024

github-actions bot locked as resolved and limited conversation to collaborators May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for toggle to force delete azurerm_kubernetes_cluster_node_pool #10411

Support for toggle to force delete azurerm_kubernetes_cluster_node_pool #10411

t3mi commented Feb 1, 2021

tombuildsstuff commented Feb 2, 2021

This comment has been minimized.

treyhendon commented Feb 27, 2021

r3mattia commented Mar 24, 2021

r3mattia commented Mar 24, 2021

treyhendon commented Mar 24, 2021

aslafy-z commented Oct 23, 2023 •

edited

Loading

rcskosir commented Mar 15, 2024

TomBillietKlarrio commented Apr 10, 2024

github-actions bot commented May 11, 2024

Support for toggle to force delete azurerm_kubernetes_cluster_node_pool #10411

Support for toggle to force delete azurerm_kubernetes_cluster_node_pool #10411

Comments

t3mi commented Feb 1, 2021

Community Note

Description

New or Affected Resource(s)

Potential Terraform Configuration

References

tombuildsstuff commented Feb 2, 2021

This comment has been minimized.

treyhendon commented Feb 27, 2021

r3mattia commented Mar 24, 2021

r3mattia commented Mar 24, 2021

treyhendon commented Mar 24, 2021

aslafy-z commented Oct 23, 2023 • edited Loading

rcskosir commented Mar 15, 2024

TomBillietKlarrio commented Apr 10, 2024

github-actions bot commented May 11, 2024

aslafy-z commented Oct 23, 2023 •

edited

Loading