Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚚 Migrate Airflow workloads to APC #4490

Closed
11 tasks done
jacobwoffenden opened this issue Jun 6, 2024 · 24 comments
Closed
11 tasks done

🚚 Migrate Airflow workloads to APC #4490

jacobwoffenden opened this issue Jun 6, 2024 · 24 comments
Assignees
Labels

Comments

@jacobwoffenden
Copy link
Member

jacobwoffenden commented Jun 6, 2024

User Story

As an Analytical Platform engineer
I want (current) Airflow jobs to schedule on APC
So that we can fully retire the Airflow EKS clusters

Value / Purpose

Airflow EKS clusters are partially managed in Terraform, pinned to IMDSv1, use kube2iam, and have no observability 😭

Migrating these workloads to APC will allow us to retire more clusters and make use of the newer capabilities in EKS and the supported tooling.

Useful Contacts

@jacobwoffenden

User Types

Platform Engineering

Hypothesis

If we... [do a thing]
Then... [this will happen]

Proposal

Migrate Airflow workloads to APC

Additional Information

This was sort of started in DPAT #2843 but never happened

Blocked by:

Definition of Done

  • Enumerate DAGs that use TGW (these need to be carefully migrated)
  • Create Airflow resources in APC
    • Managed Node Groups
      • standard (this might be OK in the general node group)
      • high-memory
    • Namespace
    • Access entry
  • Test scheduling
  • Update IRSA roles
  • Grant APC access to APDP ECR
  • Cut over TGW (🤞 this should be OK because the IP range is the same)
@jacobwoffenden
Copy link
Member Author

Blocked while Airflow component is being worked on

@jacobwoffenden jacobwoffenden moved this from 🚫 Blocked to 🚀 In Progress in Analytical Platform Jun 6, 2024
@jacobwoffenden
Copy link
Member Author

Comms sent to ask-data-engineering with sheet to fill in https://docs.google.com/spreadsheets/d/1B8DOsSgnxGV1FjRv8dLv0wqDMo2RiiMqedFogLBpQEQ

@jacobwoffenden
Copy link
Member Author

Moving back to blocked while IRSA is being worked on

@jacobwoffenden
Copy link
Member Author

jacobwoffenden commented Jun 13, 2024

I've cut a new release of the cross-account-ecr action, published a new version of template-airflow-python which used the new v1 action and correctly adds APC accounts to repo policy.

I then updated the example dag to use the new image version and APC dev context (https://github.com/moj-analytical-services/airflow/pull/3613) and below is the output when running it (even though it fails because it can't use IRSA yet, it still pulls)

vscode ➜ /workspaces/modernisation-platform-environments (main) [ aws: analytical-platform-compute-development:modernisation-platform-sandbox@eu-west-2 ] [ context: arn:aws:eks:eu-west-2:381491960855:cluster/analytical-platform-compute-development ] $ kubectl --namespace airflow get events                                     
LAST SEEN   TYPE     REASON      OBJECT                                        MESSAGE
59s         Normal   Scheduled   pod/task-1-cecda48866f94f90a3357d96206822b6   Successfully assigned airflow/task-1-cecda48866f94f90a3357d96206822b6 to ip-10-200-33-237.eu-west-2.compute.internal
58s         Normal   Pulling     pod/task-1-cecda48866f94f90a3357d96206822b6   Pulling image "189157455002.dkr.ecr.eu-west-1.amazonaws.com/template-airflow-python:v0.4"
53s         Normal   Pulled      pod/task-1-cecda48866f94f90a3357d96206822b6   Successfully pulled image "189157455002.dkr.ecr.eu-west-1.amazonaws.com/template-airflow-python:v0.4" in 5.264s (5.264s including waiting). Image size: 76701464 bytes.

@jacobwoffenden
Copy link
Member Author

APC OIDC added to APDP

@jacobwoffenden
Copy link
Member Author

We've tested @AntFMoJ's toy DAG on APC with IRSA cross account and its working 🎉

Unfortunately we are now blocked in discussion with Modernisation Platform about reuse of network ranges.

@jacobwoffenden
Copy link
Member Author

Updates:

@jacobwoffenden
Copy link
Member Author

Moving to blocked while we figure out how to proceed with Direct Connect.

@jacobwoffenden
Copy link
Member Author

Meeting with HMCTS' network architect on 11/07/24 @ 11:30 BST

@darren1988
Copy link

Escalated to HMCTS head of DTS people and profession on 24th July 2024. Our ask has now been raised with the lead PlatOps in HMCTS. Currently awaiting on a response. If no movement by the end of the week will escalate to Martyn.

@jacobwoffenden
Copy link
Member Author

Meeting help with DTS PlatOps 5/8/24 and has been escalated. Waiting for meeting to be be arranged with HMCTS stakeholders.

@darren1988
Copy link

Meeting with HMCTS arranged for 5/9/24 to discuss scope of work

@jacobwoffenden
Copy link
Member Author

Had meeting with HMCTS, they are going to put is in touch with CloudGateway

@jacobwoffenden
Copy link
Member Author

Sent chaser email on 15/10 and 22/10

@jacobwoffenden
Copy link
Member Author

meeting arranged with cloudgateway for 4/11

@jacobwoffenden
Copy link
Member Author

VPN endpoint data sent cloud gateway, awaiting response

@jacobwoffenden
Copy link
Member Author

Updated VPN configuration parameters and sent over. Apparently we are waiting on commercials too.

@jacobwoffenden
Copy link
Member Author

Pencilled some time in on Thursday 21/11 to bridge with CGW

@jacobwoffenden
Copy link
Member Author

nonprod was cutover on 21/11 🎉 prod is being arranged for 27/11

@jacobwoffenden
Copy link
Member Author

@jacobwoffenden
Copy link
Member Author

A rough cut of dev:

Total number of items: 207
Number of items migrated: 34 (16.43%)
Number of items paused: 43 (20.77%)
Number of items exempt: 20 (9.66%)
Number of items run in the last 90 days: 93 (44.93%)
Number of items that are not paused, not migrated, and have run in the last 90 days: 73

@jacobwoffenden
Copy link
Member Author

A rough cut of prod:

Total number of items: 130
Number of items migrated: 45 (34.62%)
Number of items paused: 25 (19.23%)
Number of items exempt: 31 (23.85%)
Number of items run in the last 90 days: 95 (73.08%)
Number of items that are not paused, not migrated, and have run in the last 90 days: 53

@jacobwoffenden
Copy link
Member Author

Internal networking has been cutover to APC

@jacobwoffenden jacobwoffenden moved this from 🚫 Blocked to 🛂 In Review in Analytical Platform Dec 2, 2024
@jacobwoffenden
Copy link
Member Author

Scope of this ticket is quite big - Suggest we close this one and create smaller tickets for remaining tasks?

@github-project-automation github-project-automation bot moved this from 🛂 In Review to 🎉 Done in Analytical Platform Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

2 participants