Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for toggle to force delete azurerm_kubernetes_cluster_node_pool #10411

Closed
t3mi opened this issue Feb 1, 2021 · 10 comments
Closed

Support for toggle to force delete azurerm_kubernetes_cluster_node_pool #10411

t3mi opened this issue Feb 1, 2021 · 10 comments
Labels
enhancement sdk/not-yet-supported Support for this does not exist in the upstream SDK at this time service/kubernetes-cluster upstream/microsoft/needs-support-on-azure-api This label is applicable when support for a feature is not currently available on the Azure API.
Milestone

Comments

@t3mi
Copy link
Contributor

t3mi commented Feb 1, 2021

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

In case of small sized node pool or tolerations on it which prevents re-scheduling of the pods configured with pod disruption budget following error occurs during node pool removal:

Error: waiting for the deletion of Node Pool "vmss" (Managed Kubernetes Cluster "aks-test" / Resource Group "aks-test-rg"): Code="DeleteVMSSAgentPoolFailed" Message="VMSSAgentPoolReconciler retry failed: {\n  \"code\": \"PodDrainFailure\",\n  \"message\": \"Node 'aks-vmss-17400673-vmss000001' failed to be drained with error: 'Drain did not complete pods [<pod-name>] within 10m0s'\"\n }"

New or Affected Resource(s)

  • azurerm_kubernetes_cluster_node_pool

Potential Terraform Configuration

# Configure the Azure Provider
provider "azurerm" {
  features {
    azurerm_kubernetes_cluster_node_pool {
      disable_eviction = true
    }
  }
}

References

@tombuildsstuff
Copy link
Contributor

hi @t3mi

Thanks for opening this issue.

Unfortunately at this time the Azure API doesn't support a means of draining a Node Pool - as such at this point in time the Provider attempts to destroy the Node Pool using the Delete API and waits for it to complete.

Whilst it's possible that the AKS API may provide the option to configure this in the future, unfortunately it doesn't at this point in time - and since there's no API to support this behaviour at this time, this'd need the AKS Service Team to expose a means of controlling this. As such I'm going to suggest opening an issue on the AKS Repository instead where someone from that team should be able to take a look - and once it's available we can look at integrating this into Terraform.

Thanks!

@tombuildsstuff tombuildsstuff added enhancement sdk/not-yet-supported Support for this does not exist in the upstream SDK at this time service/kubernetes-cluster upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR labels Feb 2, 2021
@tombuildsstuff tombuildsstuff added this to the Blocked milestone Feb 2, 2021
@ams0

This comment has been minimized.

@treyhendon
Copy link

Would it be possible to issue a delete on the AKS instance first (before the pools) and then have the pools refresh their TF status before issuing their delete (if they still exist)?

My struggle with this error is with completely tearing down an environment - I get the same error. My end-goal is for all of the assets to go away. In the UI, I can just delete AKS and it deletes the custom node pools for me. It seems like Terraform is issuing the delete command on my custom node pools before issuing the delete command to the AKS instance.

@r3mattia
Copy link

I can confirm as I am encountering the same issue when destroying a module with:

  • AKS resource with default node pool
  • 2x additional node pools associated with the cluster

It tries to destroy the node pools and the cluster simultaneously for a few minutes before eventually returning the same error.

@r3mattia
Copy link

However, it turns out that after re-running terraform destroy shortly after this error, the cluster got destroyed, whilst the node pools remain in the Still destroying state.

@treyhendon
Copy link

As a work around for others who find this, we added a several kubectl commands to delete all non system/AKS namespaces in the pipeline prior to executing the Terraform step to destroy the environment. It slows the process, but ensures a successful TF destroy.

@aslafy-z
Copy link
Contributor

aslafy-z commented Oct 23, 2023

resource "azurerm_kubernetes_cluster_node_pool" "infra" {
[...]

  node_labels = {
    "example.com/cluster-name" = azurerm_kubernetes_cluster.main.name,
    "example.com/resource-group" = azurerm_kubernetes_cluster.main.resource_group_name,
  }

  provisioner "local-exec" {
    when        = destroy
    command     = <<-EOF
      az aks command invoke -g "${self.node_labels["example.com/resource-group"]}" -n "${self.node_labels["example.com/cluster-name"]}" -c "kubectl drain -l 'kubernetes.azure.com/agentpool=${self.name}' --ignore-daemonsets --delete-emptydir-data --timeout=10m || kubectl drain -l 'kubernetes.azure.com/agentpool=${self.name}' --ignore-daemonsets --delete-emptydir-data --timeout=10m --disable-eviction=true"
    EOF

    interpreter = ["/usr/bin/env", "bash", "-c"]
  }

Here's a workaround that uses the AKS Run command primitive that triggers jobs inside the cluster.
Before calling AKS api to destroy, it executes a drain, and after 10 minutes, it some resources are blocked by PDB, drains with forced eviction.
In order for the run command not to fail, I added an annotation to run these jobs on the default node pool by configuring the namespace with

resource "azurerm_kubernetes_cluster" "main" {
[...]

  run_command_enabled = true

  provisioner "local-exec" {
    command     = <<-EOF
      az aks command invoke -g "${self.resource_group_name}" -n "${self.name}" -c "kubectl create ns aks-command && kubectl annotate ns/aks-command scheduler.alpha.kubernetes.io/node-selector=kubernetes.azure.com/agentpool=${self.default_node_pool[0].name}"
    EOF

    interpreter = ["/usr/bin/env", "bash", "-c"]
  }
}

This is clearly not ideal but will do the trick until AKS supports it natively (Azure/AKS#2090).

@rcskosir rcskosir added upstream/microsoft/needs-support-on-azure-api This label is applicable when support for a feature is not currently available on the Azure API. and removed upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR labels Mar 15, 2024
@rcskosir
Copy link
Contributor

Thanks for taking the time to open this issue. It looks like the behavior you requested is not supported by the underlying Azure API so I am going to label this issue as such and close it for now. When it gets added, we can reopen this request or you can create a new one.
Azure/AKS#2090

@rcskosir rcskosir closed this as not planned Won't fix, can't repro, duplicate, stale Mar 15, 2024
@TomBillietKlarrio
Copy link

I'm a bit confused here:

  • when I delete a nodepool via the Azure UI, it respects the PDB
  • when I delete a nodepool via Terraform, it ignored the PDB

When I turn on debug logging in terraform, I see this happening on a delete of a nodepool:

2024-04-10T16:22:31.863+0200 [DEBUG] provider.terraform-provider-azurerm_v3.95.0_x5: DELETE https://management.azure.com/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx/agentPools/tom1c50?api-version=2023-06-02-preview&ignore-pod-disruption-budget=true: timestamp=2024-04-10T16:22:31.863+0200

So that api seems to indicate there is a parameter for this?

Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement sdk/not-yet-supported Support for this does not exist in the upstream SDK at this time service/kubernetes-cluster upstream/microsoft/needs-support-on-azure-api This label is applicable when support for a feature is not currently available on the Azure API.
Projects
None yet
Development

No branches or pull requests

8 participants