Table of Contents
This is a solution to build a completed MLOps pipeline in production for a typical ML system. This example could be useful for any engineer or organization looking to operationalize ML with native AWS development tools such as CodePipeline, CodeBuild, and ECS.
The main use cases of this solution are:
- Your team wants to deploy infrastructure for an ML endpoint of an ML system. Let's call this endpoint
A
. - After that, your team wants to deploy another ML endpoint of the same ML system into the existing infrastructure. Let's call this endpoint
B
.
In the following diagram, you can view the continuous delivery stages of the system.
- Developers push code to trigger the CodePipeline
- The CodePipeline runs CodeBuild job to run the CloudFormation templates to create resources (first time running) or update resources (second time running)
- CodePipeline: has various stages that define which step through which actions must be taken in which order to go from source code to creation of the production resources.
- CodeBuild: builds the source code from GitHub and runs CloudFormation templates.
- CloudFormation (CF): creates resources using YAML template.
- Elastic Container Registry (ECR): stores docker images.
- Elastic Container Service (ECS): groups container instances on which we can run task requests.
- Elastic File System (EFS): stores user request's data and model's weights.
- Application Load Balancer (ALB): distributes incoming application traffic across multiple target groups in ECS across Availability Zones. It monitors the health of its registered targets and routes traffic only to the healthy targets.
- Route 53: connects user requests to infrastructure running in AWS, in our case, the ALB. In this project, we use another domain provider to route the traffic at the domain level.
- AWS Certificate Manager (ACM): provisions, manages, and deploys public and private Secure Sockets Layer/Transport Layer Security (SSL/TLS) certificates for use with AWS services.
- Virtual Private Cloud (VPC): controls our virtual networking environment, including resource placement, connectivity, and security.
- CloudWatch: collects monitoring and operational data in the form of logs, metrics, and events.
- Simple Notification Service (SNS): manages messaging service for both application-to-application (A2A) and application-to-person (A2P) communication. In this project, we don't configure SNS.
Creating the infrastructure for endpoint A has several steps:
- Create the CloudFormation stack
- Create CodePipeline and CodeBuild projects
- Validate resources' permission and states
- Upload model's weights
This infrastructure is reusable for the other endpoint.
The CloudFormation stack creates the ECR repository. The CodePipeline and CodeBuild that we will create in the later steps depend on this ECR repository.
- Set parameter
DesiredCount
incf_templates/create-ep-a.json
to 0 to avoid the errordocker image is not ready
when the CloudFormation stack is created, because at the time the stack is created, the ECR repository doesn't exist. Set other parameters as well. - Run
aws cloudformation create-stack --stack-name=create-ep-a --template-body file://cf_templates/create-ep-a.yaml --parameters file://cf_templates/create-ep-a.json --capabilities CAPABILITY_NAMED_IAM
This step creates manually CodePipeline and CodeBuild projects. In the next version of this tutorial, this step should be defined in a CloudFormation template.
-
Update
buildspec/ep-a.yaml
file and push the code -
Go to AWS Console > CodePipeline > Create a new pipeline
-
Configure pipeline settings
- Pipeline name:
ep-a
- Select
New service role
- Pipeline name:
-
Add source stage
- Connect to GitHub, select repository and branch name
- Enable
Start the pipeline on source code change
- Output artifact format:
Full clone
-
Add build stage
- Build provider:
AWS CodeBuild
- Select your region
- Create a CodeBuild project with the following settings:
- Project name:
ep-a
- Environment:
- System:
Ubuntu
- Runtime:
Standard
- Image:
standard:4.0
- Environment type:
Linux
- Privileged:
Enabled
- Create a new service role
- Buildspec: use a buildspec file
- Buildspec name:
./buildspec/ep-a.yaml
- System:
- Build type:
Single build
- Build provider:
-
Skip deploy stage
-
Review and create CodePipeline
Follow the section Add CodeBuild GitClone permissions for connections to Bitbucket, GitHub, or GitHub Enterprise Server
at this link.
Make sure the health check path parameter in cf_templates/create-ep-a.json
is correct.
Manually add the ECS task's security group of the endpoint A to the Inbound rules of the security group created for EFS shared volumes. Without this, the ECS's instances cannot access EFS shared volumes. Check this article for more detail.
This step can be automated by creating an AWS Lambda function to run the validation task.
-
Set parameter
DesiredCount
incf_templates/create-ep-a.json
to the expected value. -
Set parameter
APITag
to the latest git commit hash in the "master" branch. -
Update stack. This might take ~10m.
aws cloudformation update-stack --stack-name=<stack-name> --template-body file://cf_templates/create-ep-a.yaml --parameters file://cf_templates/create-ep-a.json --capabilities CAPABILITY_NAMED_IAM
-
Validate resources
- ECS: check services, tasks, instances, instances' logs
- CloudFormation: check created resources, stack status
- Create an S3 folder like
s3://<bucket-name>/<folder-name>
- Upload model's weights to this folder
- Mount the EFS shared weights folder to an EC2 bastion instance (check section
Miscellaneous
below for the instruction of creating this EC2 instance). Check CloudFormation stack's resources for the EFS shared volume's ID - Run
s3 sync
sudo aws s3 sync s3://<bucket-name>/<folder-name> /mnt/<MOUNTED_FOLDER> --delete
- Add the command above as the
cronjob
ofsudo
usersudo crontab -e # Add this line to perform synchronization every 3rd minute of hour */3 * * * * aws s3 sync s3://<bucket-name>/<folder-name> /mnt/<MOUNTED_FOLDER> --delete
- Go to the domain service provider (or Route 53), add the expected record (eg.
HostHeaderApi
parameter incf_templates/create-ep-a.json
) to point to the DNS name of the ALB created by the CloudFormation stack - Test the API by sending a request to the expected URL
- Mount the shared assets EFS folder to the EC2 bastion instance to validate if the data is stored correctly
Adding endpoint B into the existing infrastructure has similar steps as creating the infrastructure for endpoint A.
- Set parameter
DesiredCount
in "cf_templates/add-ep-b.json" to 0. - Set parameter ListernerRulePriority to the
latest priority + 1
given that thelatest priority
in the ALB is currently in use - Run
aws cloudformation create-stack --stack-name=add-ep-b --template-body file://cf_templates/add-ep-b.yaml --parameters file://cf_templates/add-ep-b.json --capabilities CAPABILITY_NAMED_IAM
This step is similar to section 1.2. Create CodePipeline and CodeBuild projects
. For the buildspec
file, just clone buildspec/ep-a.yaml
file for the endpoint B.
This step is similar to section 1.3. Validate resources
.
This step is similar to section 1.4. Update CloudFormation stack
.
This step is similar to section 1.5. Upload model's weights
.
This step is similar to section 1.6. Test endpoint
.
In the future, when you want to update and validate the CloudFormation template without triggering the CodePipeline either manually or automatically, follow these steps:
- Set the
APITag
parameter in the parameter JSON file to the latest git commit hash in the master branch. - Run
aws cloudformation update-stack
command. - After confirming the template is usable, discard the changes of the
APITag
parameter in the parameter JSON file and commit the rest of the changes.
Sometimes you might want to access the containers in Fargate/EC2 instances for debugging purposes. Follow these below instructions to do it. Read this article for more information.
- Install SSM
- Make sure the ECS Task Role allows these actions:
"ssmmessages:CreateControlChannel" "ssmmessages:CreateDataChannel" "ssmmessages:OpenControlChannel" "ssmmessages:OpenDataChannel"
- Make sure
EnableExecuteCommand
is true in theAWS::ECS::Service
resource in the CloudFormation template.- If you haven't set it, then set it to true, and update the cloudformation stack
- Manually stop all the tasks run by your ECS Service
- Wait until the new tasks are deployed
- Run this command to
exec
into the container.aws ecs execute-command --cluster <ecs-cluster> --task <TASK ID> --command "/bin/bash" --interactive
The purpose of mounting the EFS shared volume is to manipulate the saved data. Follow these below instructions. Read this article for more detail.
- Add your IP into the Inbound Rules of the Security Group of the EC2 bastion instance
- Ask your Admin to get the
.pem
file to SSH to the EC2 bastion instance - Make sure the NFS client is installed on the EC2 instance
- Create a directory to mount the EFS shared volume to
sudo mkdir /mnt/new-folder
- Go to AWS Console > EFS to get the correct EFS file system ID of the EFS shared volume
- Run
sudo mount -t nfs -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport <efs-file-system-id>.efs.<region>.amazonaws.com:/ /mnt/new-folder
- Validate the mount point
df -aTh
- Remove the record that points to the ALB's DNS name in your domain service.
- Delete all related S3 buckets.
- Delete CodePipeline and CodeBuild projects
- Delete all the CF stacks one by one starting from the top one. Don't delete them all at once.
Distributed under the MIT License. See LICENSE for more information.
Tung Dao - LinkedIn
Project Link: https://github.com/dao-duc-tung/ecs-ml-deploy-system