-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] EC2 Fleet provider #394
Comments
Last time I tried to use EC2 Fleet I found this: cdk-github-runners/src/providers/ec2.ts Line 379 in da6a0c0
If you find a good way to get around those issues, I'd be happy to merge a PR for it. We might be able to do it with a Lambda function. I have been trying to avoid those to keep the Step Function simple. But we have so many Lambda functions around, how bad can one more be? |
@kichik you can define a Launch Template with User Data, Security Groups, Disks, etc and then use that Launch Template in the Fleet Request. Sorry, I don't have time currently to work up a full PR, but something like this: new stepfunctions_tasks.CallAwsService(this, "FleetRequest", {
comment: "FleetRequest",
integrationPattern: IntegrationPattern.WAIT_FOR_TASK_TOKEN,
service: 'ec2',
action: 'fleetRequest',
heartbeatTimeout: stepfunctions.Timeout.duration(Duration.minutes(10)),
iamResources: ['*'],
parameters: {
TargetCapacitySpecification: {
TotalTargetCapacity: 1,
OnDemandTargetCapacity: 0,
SpotTargetCapacity: 1,
DefaultTargetCapacityType: 'spot',
},
LaunchTemplateConfigs: this.subnets.map((subnet, _) => {
return {
LaunchTemplateSpecification: {
LaunchTemplateId: this.ami.launchTemplate.launchTemplateId,
Version: this.ami.launchTemplate.versionNumber.toString(),
},
Overrides: {
InstanceType: this.instanceType.toString(),
IamInstanceProfile: {
Arn: instanceProfile.attrArn,
},
InstanceInitiatedShutdownBehavior: ec2.InstanceInitiatedShutdownBehavior.TERMINATE,
SubnetId: subnet.subnetId,
MetadataOptions: {
HttpTokens: 'required',
},
SecurityGroupIds: this.securityGroups.map(sg => sg.securityGroupId),
BlockDeviceMappings: [{
DeviceName: rootDeviceResource.ref,
Ebs: {
DeleteOnTermination: true,
VolumeSize: this.storageSize.toGibibytes(),
},
}],
UserData: stepfunctions.JsonPath.format(
stepfunctions.JsonPath.stringAt('$.ec2.userdataTemplate'),
...params,
),
}
}
}),
},
}); |
@cynicaljoy where do you see |
@kichik err sorry, napkin code. I'm translating this from a Go utility service I have. How I'm handling that is I first create a new version of the LaunchTemplate w/the updated UserData and then I use that version of the Launch Template in the Fleet request. Then I cleanup the oldest version of the LaunchTemplate to ensure that we don't grow the LaunchTemplate versions beyond the account limit which is 1000 by default. |
I don't love creating a new resource for every runner. Combined with the idle reaper, that may cause launch templates to be left behind and reach that 1000 limit. It can also cause issues with deployments as we will be modifying resources created by CloudFormation. The "cleanest" method (and those quotes are working real hard) I could think of so far was to create a queue for each fleet runner type. The instances could then pull their next runner task from the queue when they boot up. It will take real careful work to avoid having old jobs remain in the queue. What will take them off the queue if EC2 fails to launch, for example? |
what about making the user data more idempotent? rather than the params being populated into the template and added to the launch template, those values are added to param store or secrets manager w/the task token as part of the key and the task token is added to the resource tags. then the idle reaper can clean up. |
The problem with param store/secrets manager is that we need to be able to spin up multiple instances at the same time. They will override each other. |
Am I looking in the right place? (PowerShell below that too of course). I was thinking the workflow would be to create unique secret item (whether Secrets Manager or Parameter Store): e.g. Of course, this would be mean some additional required IAM scope and the secret item path should be configurable. |
Yes.
Any resource being created by the step function (workflow) will pose a problem as it is implemented right now. Idle reaper will abort step functions which means created resources can be left behind. CreateFleet does allow overriding instance resource tags for |
Any update on this issue? We really need multiple instance types and purchase options for our runners to optimise the cost using spot instances. |
@choco-nishant #518 is working IIRC. It's just limited to I believe 1 hour jobs. I haven't had a chance yet to fix that. But if one hour is enough for you, you can already use that branch. |
Thanks. I'll try it out since we already set a 60-minute timeout for the jobs. |
The current EC2 provider is awesome. I'm just wondering we could implement a similar provider but using EC2 Fleet as an actual provisioner. It is neat because it can configure multiple instance types and purchase options with built-in strategies and would be super useful to find the lowest spot price without complicated logic, for example.
If you don't have any plan yet. I could work on it by implementing a new provider like
Ec2FleetProvider
but trying to align with the props ofEc2Provider
as much as possible so that users can easily switch from one to the other.Let me know your opinions!
The text was updated successfully, but these errors were encountered: