-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Newbie] Supporting Yunikorn and Kueue #5915
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: yuteng <a08h0283@gmail.com>
Signed-off-by: yuteng <a08h0283@gmail.com>
Signed-off-by: yuteng <a08h0283@gmail.com>
Signed-off-by: yuteng <a08h0283@gmail.com>
My preference is to use kueue or anyone that has min changes to codebase and does not need too many modifications to the pods |
Signed-off-by: yuteng <a08h0283@gmail.com>
|
||
## 1 Executive Summary | ||
|
||
Providing kubernetes (k8s) resource management, gang scheduling and preemption for flyte applications by third-party software, including Apache Yunikorn and Kueue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain what preemption means here compared to what preemption means in the context of spot instances on e.g. AWS or GCP?
Flyte support multi-tenancy and various k8s plugins. | ||
|
||
Kueue and Yunikorn support gang scheduling and preemption. | ||
Gang scheduling guarantees the availability of certain K8s crd services, such as Spark, Ray, with sufficient resource and preemption make sure high priority task execute immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guarantees the availability of certain K8s crd services, such as Spark
I would rather say that gang scheduling guarantees that all worker pods derived from a CRD are scheduled at the same time. Would add that this is important to prevent waste of resources when jobs can partially start without being able to do any meaningful work.
gangscheduling: "placeholderTimeoutInSeconds=60 gangSchedulingStyle=hard" | ||
- type: "spark" | ||
gangscheduling: "placeholderTimeoutInSeconds=30 gangSchedulingStyle=hard" | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this list complete or an example? I.e. will this also work for plugins like kubeflow pytorch, tf, mpi or dask, ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example.
Admin can set default configuration about gang scheduling for CRD in flyte k8s plugins.
gangscheduling: "placeholderTimeoutInSeconds=30 gangSchedulingStyle=hard" | ||
``` | ||
|
||
Mentioned configuration indicates what queues exist for an org. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain what an org is in this context? It's not the same as this org
right?
Hierarchical queues will be structured as follows. | ||
root.org1.ray、root.org1.spark and root.org1.default". | ||
|
||
ResourceFlavor allocates resource based on labels which indicates that category-based resource allocation by organization label is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain how the the reseource flavor will be determined? Is there a way to automatically derive this from the task decorator args @task(resources=..., accelerator=...)
?
It would be really nice if tasks that need e.g. an A100 GPU were automatically not in the same queue as tasks that need 2 x T4 GPUs. We're using kubeflow pytorch jobs with scheduler plugins' gang scheduling and have observed jobs being starved that the cluster would have had resources for because other jobs which were trying to get different GPU types couldn't be scheduled but which had a higher priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, need to create Kueue CRDs first.
A cluster queue defines resource quota which property is defined by resource flavors.
I think creating resource flavors to categorizing resources under a cluster queue is available solution.
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "spot-t4"
spec:
nodeLabels:
cloud.google.com/gke-accelerator: nvidia-tesla-t4
nodeTaints:
- effect: NoSchedule
key: cloud.google.com/gke-accelerator: nvidia-tesla-t4
value: "true"
tolerations:
- key: "spot-taint"
operator: "Exists"
effect: "NoSchedule"
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cluster-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "spot-t4"
resources:
- name: "cpu"
nominalQuota: 9
- name: "memory"
nominalQuota: 36Gi
- name: "nvidia.com/gpuu"
nominalQuota: 50
- name: "spot-a100"
resources:
- name: "cpu"
nominalQuota: 18
- name: "memory"
nominalQuota: 72Gi
- name: "nvidia.com/gpu"
nominalQuota: 100
In the other hand, kueue preemption requires Kueue WorkloadPriorityClass and patching job with label.
The plugin received the preemption label and then it should patch it to pods belonging same job
root.org1.ray、root.org1.spark and root.org1.default". | ||
|
||
ResourceFlavor allocates resource based on labels which indicates that category-based resource allocation by organization label is available. | ||
Thus, a clusterQueue including multiple resources represents the total acessaible resource for an organization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this sentence tbh, could you please explain/expand?
A tenant can submit organization-specific tasks to queues such as org.ray, org.spark and org.default to track which job types are submittable. | ||
|
||
|
||
A SchedulerConfigManager maintains config from mentioned yaml. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SchedulerConfigManager
would be a go struct or are you suggesting a new backend service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this in any of the code snippets below.
A SchedulerConfigManager maintains config from mentioned yaml. | ||
It patches labels or annotations on k8s resources after they pass rules specified in the configuration. | ||
|
||
```go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have a single interface and two implementations of the same interface for yunikorn and kueue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer in func (e *PluginManager) launchResource(
to not call queue or yunikorn specific code, see snippet below, but just a general interface whose implementation depends on the propeller config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, i updated the document..
// Add label is the specific label doesn't exist | ||
} | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do any additional k8s resources have to be created for the queues or does a queue exist as soon as a pod has an annotation with a new queue name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Kueue CRDs describe the quota a queue when adopting Kueue.
In the other hand, queues are configured by setting [Yunikorn configuration] (https://yunikorn.apache.org/docs/user_guide/queue_config) if adopting Yunikorn.
in the other hand, a Kueue scheduler plugin constructs labels including localQueueName, preemption. | ||
|
||
```go | ||
func (e *PluginManager) launchResource(ctx context.Context, tCtx pluginsCore.TaskExecutionContext) (pluginsCore.Transition, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe addObjectMetadata
which is called by launchResource
would be a better place to inject the required metadata. Or do we need to inject something other than labels/annotations?
if err != nil { | ||
return pluginsCore.UnknownTransition, err | ||
} | ||
if o, err = e.SchedulerPlugin.MutateResourceForKueue(o); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only mutate the resource if the plugin manager manages a plugin which the user configured a queue for, right? How will this matching be done? Just comparing this type string
queueconfig:
scheduler: yunikorn
jobs:
- type: "ray"
to the name of the plugin?
o, err := e.plugin.BuildResource(ctx, k8sTaskCtx) | ||
if err != nil { | ||
return pluginsCore.UnknownTransition, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to not have Kueue
specific code here but a general interface, see this comment.
} | ||
``` | ||
When batchscheduler in flyte is yunikorn, some examples are like following. | ||
For example, this appoarch submits a Ray job owned by user1 in org1 to "root.org1.ray". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does flytepropeller know the user from? Or does the user not matter as the label "root.org1.ray"
suggests?
| Spark | v | x | | ||
| Ray | v | v | | ||
| Kubeflow | x | v | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, one only needs to add labels/annotations on the worker pods. Can't we do this purely from flyte by modifying the pod template spec of the respective CRD? What do the operators have to do in addition to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, current progress fetch the pod templates from CRDs and patch label on them.
If operators implement the mechanism to patch group label for their CRD to support gang scheduling in the future, we can start to remove the code for generating group labels to reduce the maintaining overhead.
|
||
Yunikorn and Kueue support gang scheduling allowing all necassary pods to run sumultaneously when required resource are available. | ||
Yunikorn provides preemption calculating the priority of applications based on thier priority class and priority score of the queue where they are submitted, in order to trigger high-prioirty or emergency application immediately. | ||
Yunikorn's hierarchical queue includes grarateed resources settings and ACLs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could you please run the doc through a spelling checker? Thank you 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i ran the make spellcheck in the latest commit :)
Signed-off-by: yuteng <a08h0283@gmail.com>
Signed-off-by: yuteng <a08h0283@gmail.com>
Signed-off-by: yuteng <a08h0283@gmail.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5915 +/- ##
==========================================
- Coverage 36.71% 33.15% -3.57%
==========================================
Files 1304 1013 -291
Lines 130081 107571 -22510
==========================================
- Hits 47764 35667 -12097
+ Misses 78147 68744 -9403
+ Partials 4170 3160 -1010
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
Signed-off-by: yuteng <yuteng@gmail.com>
Signed-off-by: yuteng <yuteng@gmail.com>
Signed-off-by: yuteng <yuteng@gmail.com>
Tracking issue
Related to #5575
Why are the changes needed?
What changes were proposed in this pull request?
How was this patch tested?
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link