Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stress Tests MCAD using KWOK #469

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
160 changes: 160 additions & 0 deletions test/stress-tests-kwok/gpu-tests.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
## MCAD GPU Request Performance test with KWOK
This experiment assumes that you have a KWOK controller as well as MCAD controller running inside a kind cluster. If not, follow [this](https://github.com/vishakha-ramani/multi-cluster-app-dispatcher/blob/main/test/perf-test/simulatingnodesandappwrappers.md) for installation instructions.
The MCAD service for gpu requests is little weird . Here is the experiment I did and what I observed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for gpu requests is little weird can we change the wording and give the user a little more insight into the use case we ought to test, please?

1. Created two fake nodes with 8 gpus each by running the script
```
./nodes.sh
```
2. Check that reqested number of nodes have started
```
% kubectl get nodes
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 27d v1.27.1
kwok-node-1 Ready agent 7s fake
kwok-node-2 Ready agent 7s fake
```

3. Submit an AW job that wraps two pods, each requesting for 8 gpus.
```
% ./kwokmcadperf.sh
Checking whether we have a valid cluster login or not...

Nice, looks like you're logged in
Checking MCAD Controller installation status

Nice, MCAD Controller is installed
Checking MCAD Controller installation status

Nice, the KWOK Controller is installed

How many fake KWOK appwrapper jobs do you want? 1
How many pods in a job? 2
How many GPUs do you want to allocate per pod? 8
jobs number is 1
Number of GPUs per pod: 8
Number of pods per AppWrapper: 2
...
...
```

4. We can see that the two pods are scheduled and run to completion.
```
% kubectl get pods
NAME READY STATUS RESTARTS AGE
fake-defaultaw-schd-spec-with-timeout-1-4r4t2 0/1 Completed 0 4s
fake-defaultaw-schd-spec-with-timeout-1-tx9d5 0/1 Completed 0 4s
```
Furthermore, they are scheduled on two different nodes (as they should be).


5. Delete the previous AW job
```
kubectl delete appwrapper fake-defaultaw-schd-spec-with-timeout-1
```

6. Create a new AW consisting of one pod requesting 16 gpus.
```
% ./kwokmcadperf.sh
Checking whether we have a valid cluster login or not...

Nice, looks like you're logged in
Checking MCAD Controller installation status

Nice, MCAD Controller is installed
Checking MCAD Controller installation status

Nice, the KWOK Controller is installed

How many fake KWOK appwrapper jobs do you want? 1
How many pods in a job? 1
How many GPUs do you want to allocate per pod? 16
jobs number is 1
Number of GPUs per pod: 16
Number of pods per AppWrapper: 1
...
...
```

7. The pod is scheduled to one of the fake nodes (which theoretically it shouldn't).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, did we ever print a histogram inside MCAD and see how it looks? also, can we mention which version of MCAD was used for testing?

```
% kubectl get pods
NAME READY STATUS RESTARTS AGE
fake-defaultaw-schd-spec-with-timeout-1-v7qbk 0/1 Completed 0 40s
```

8. Delete the previous AW job
```
kubectl delete appwrapper fake-defaultaw-schd-spec-with-timeout-1
```

9. Create a new AW consisting of one pod requesting for 24 gpu.
```
% ./kwokmcadperf.sh
Checking whether we have a valid cluster login or not...

Nice, looks like you're logged in
Checking MCAD Controller installation status

Nice, MCAD Controller is installed
Checking MCAD Controller installation status

Nice, the KWOK Controller is installed

How many fake KWOK appwrapper jobs do you want? 1
How many pods in a job? 1
How many GPUs do you want to allocate per pod? 24
jobs number is 1
Number of GPUs per pod: 24
Number of pods per AppWrapper: 1
...
...
```
10. The AW job is now in the queue and is pending.
```
% kubectl describe appwrapper fake-defaultaw-schd-spec-with-timeout-1
...
...
Status:
Conditions:
Last Transition Micro Time: 2023-06-21T14:12:40.279735Z
Last Update Micro Time: 2023-06-21T14:12:40.279734Z
Status: True
Type: Init
Last Transition Micro Time: 2023-06-21T14:12:40.280678Z
Last Update Micro Time: 2023-06-21T14:12:40.280677Z
Reason: AwaitingHeadOfLine
Status: True
Type: Queueing
Last Transition Micro Time: 2023-06-21T14:12:40.289959Z
Last Update Micro Time: 2023-06-21T14:12:40.289958Z
Reason: FrontOfQueue.
Status: True
Type: HeadOfLine
Last Transition Micro Time: 2023-06-21T14:12:40.297836Z
Last Update Micro Time: 2023-06-21T14:12:40.297836Z
Message: Insufficient resources to dispatch AppWrapper.
Reason: AppWrapperNotRunnable.
Status: True
Type: Backoff
Controllerfirsttimestamp: 2023-06-21T14:12:40.279730Z
Filterignore: true
Queuejobstate: HeadOfLine
Sender: before ScheduleNext - setHOL
State: Pending
Systempriority: 9
```

11. Add a fake nodes in the cluster with 8 gpus (at this point, the cluster has 24 gpus in total, uniformly spread across 3 nodes)
```
% kubectl apply -f fake-node.yaml
node/fake-node-1 created
```

12. The job is now dispatched, and runs to completion.
```
% kubectl get pods
NAME READY STATUS RESTARTS AGE
fake-defaultaw-schd-spec-with-timeout-1-fb649 0/1 Completed 0 7s
```

13. This tells us that with respect to KWOK, MCAD is looking at the aggregated gpu resources before making a dispatch decision.
Loading