Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
contrib/aws: Fix cluster resource leaks for superseded jobs
If a user puts up a PR, and then proceeds to push to that PR while AWS CI is still running, Jenkins will use a feature called milestones to cancel the old build. Jenkins will try to nicely abort the stage with sig-term like behavior, wait 10 seconds and do another nice abort with sig-term behavior, and finally wait 10 more seconds before killing the stage with sig-kill like behavior. We have a race condition that sometimes leaks clusters if the cleanup didn't finish before the stage got forcefully terminated. After Jenkins terminates the stages, it always runs the post-build actions. Currently, the post build actions only call cleanup in one region, but the CI creates clusters in multiple regions. If/When resources are leaked, they get automatically cleaned up by the resource reaper in the account after they are alive for a fixed time. This PR allows us to be more frugal by cleaning up correctly, instead of relying on the resource reaper to clean up for us. It also prevent us from hitting our tight AWS account level EC2 limits. This PR fixes our cleanup race by calling the cleanup logic for every region in the ALWAYS post build stage. If the cluster is already cleaned up (successful run), the cleanup call is a no-op. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
- Loading branch information