-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable create workload for h150 #133
base: main
Are you sure you want to change the base?
Conversation
xpk.py
Outdated
' kubectl config view && kubectl config set-context --current' | ||
' --namespace=default' | ||
) | ||
if args.device_type == h150_device_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Obliviour Here we need to add --device-type
back to workload deletion, because A3 and A3+ handle zone/region differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get-credentials
command is searching for the cluster. XPK only creates regional clusters so it should only need to search for a cluster using --region arg.
A scenario where you would need --zone is if the cluster was created outside of xpk using --zone. I am guessing that is what happened in the h150 cluster you are testing with. So we shouldn't need a different code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial read, probably need to take a deeper look Monday.
@@ -2123,7 +2102,7 @@ def parse_env_config(args, tensorboard_config): | |||
for key, value in tensorboard_config.items(): | |||
env[key.upper()] = value | |||
|
|||
if device_type == h100_device_type: | |||
if device_type == h100_device_type or device_type == h150_device_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change this to if accelerator_type = AcceleratorType['gpu']
?
@@ -3120,7 +3099,7 @@ def get_capacity_arguments_from_capacity_type( | |||
capacity_args = '--spot' | |||
case CapacityType.RESERVATION: | |||
capacity_args = ( | |||
f'--reservation-affinity=specific --reservation={args.reservation}' | |||
f'--reservation-affinity=specific --reservation={args.reservation} --placement-policy={args.reservation}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed in TPUs / CPUs? I see the build test failing with TPUs, I assume related to this.
xpk.py
Outdated
' kubectl config view && kubectl config set-context --current' | ||
' --namespace=default' | ||
) | ||
if args.device_type == h150_device_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get-credentials
command is searching for the cluster. XPK only creates regional clusters so it should only need to search for a cluster using --region arg.
A scenario where you would need --zone is if the cluster was created outside of xpk using --zone. I am guessing that is what happened in the h150 cluster you are testing with. So we shouldn't need a different code path.
if args.device_type == h100_device_type: | ||
return "kueue.x-k8s.io/queue-name: multislice-queue" # Name of the LocalQueue | ||
else: | ||
return "" #A3+ doesn't support Kueue yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why A3+ doesn't support kueue? and what is missing there
Fixes / Features
Testing / Documentation
Testing details.