Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: Issue with Scaling Already Scaled Deployments #2464

Closed
Doofus100500 opened this issue Nov 13, 2024 · 34 comments · Fixed by kedacore/keda#6368
Closed

[🐛 Bug]: Issue with Scaling Already Scaled Deployments #2464

Doofus100500 opened this issue Nov 13, 2024 · 34 comments · Fixed by kedacore/keda#6368

Comments

@Doofus100500
Copy link
Contributor

What happened?

Hi, thank you so much for refactoring the grid scaler, it’s really awesome. However, I’m noticing issues with scaling already scaled deployments. Here’s an example:

I request 10 sessions, and 10 pods with browsers are launched. Then I request an additional 20 sessions, but only 10 more pods are added to the existing ones. This results in 10 sessions remaining in the queue until the previous sessions are completed.

I think the scaling logic should work as follows: if there are already 10 pods running, then when 20 more sessions are queued, the total should scale up to 30 pods.

keda version: v2.16.0

image

Command used to start Selenium Grid with Docker (or Kubernetes)

helm

Relevant log output

add a screenshot from grafana

Operating System

k8s

Docker Selenium version (image tag)

4.26.0-20241101

Selenium Grid chart version (chart version)

0.37.1

Copy link

@Doofus100500, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

I am not sure if this is a scaler logic issue or KEDA core behavior.
If possible, can you give me the GraphQL response in JSON when it is going to that situation? I want a data to test the scaler logic
Use this query for GraphQL

"query": "{ grid { sessionCount, maxSession, totalSlots }, nodesInfo { nodes { id, status, sessionCount, maxSession, slotCount, stereotypes, sessions { id, capabilities, slot { id, stereotype } } } }, sessionsInfo { sessionQueueRequests } }"

@Doofus100500

This comment was marked as resolved.

@miguel-cardoso-mindera
Copy link

I'm having similar issues, the scaler is not working properly as seen in the image:
image

ScaledObject:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: selenium4-in-cluster-local-selenium-chrome-scaledobject
  namespace: selenium
  labels:
    deploymentName: selenium4-in-cluster-local-selenium-chrome-node
spec:
  maxReplicaCount: 1000
  pollingInterval: 5
  scaleTargetRef:
    name: selenium4-in-cluster-local-selenium-chrome-node
  triggers:
    - type: selenium-grid
      metadata:
        browserName: 'chrome'
      authenticationRef:
        name: selenium4-in-cluster-local-selenium-scaler-trigger-auth

I'd expect that if the queue size is 50, it would add 50 pods, as it checks every 5s. Instead it adds a few pods only at the time and queue grows in size

@VietND96
Copy link
Member

I used the data to test scaler logic, and the count was returned correctly.
I tried to update the triggers config, which refers to the PR changes. Updated metricType from AverageValue to Value and enabled metrics cached.
A new chart version will be out soon for your evaluation.

@Doofus100500

This comment was marked as resolved.

@VietND96
Copy link
Member

So, it is scaling with the formula 10 x 10 ~= 99 instances?
What if we do not specify advanced in the resource spec?

@Doofus100500
Copy link
Contributor Author

Apparently, yes.
I did it like this:

spec:
  maxReplicaCount: 220
  minReplicaCount: 0
  pollingInterval: 5
  scaleTargetRef:
    kind: Deployment
    name: selenium-grid-selenium-chrome-node-v120
  triggers:
  - authenticationRef:
      name: selenium-grid-selenium-scaler-trigger-auth
    metadata:
      browserName: chrome
      browserVersion: "120"
      nodeMaxSessions: "1"
      platformName: linux
      sessionBrowserName: chrome
      unsafeSsl: "true"
    metricType: Value
    type: selenium-grid
    useCachedMetrics: true

I’m still getting 99, but it scales smoothly =)

@VietND96
Copy link
Member

Yes, but it seems not optimized. Let me find any configs needed for ScaledObject.
By the way, can you share your tests in a standalone script that I can use to reproduce and test the config?

@VietND96
Copy link
Member

Can you try to set useCachedMetrics: false (or remove it) to see how it scales?

@Doofus100500
Copy link
Contributor Author

Unfortunately, I can’t share it directly as it accesses an internal resource, but there’s nothing overly complex in it. Here it is (I removed sensitive information):

[Test]
public void EnterToPortalUserProfilePage()
{
    Execute(
        driver =>
        {
            var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));
            driver.Manage().Window.Maximize();
            Thread.Sleep(10.Seconds());
            driver.Navigate().GoToUrl("https://someurl/");
            var loginAndPassTab = wait.Until(driver => driver.FindElement(By.CssSelector("a[data-tid='tab_login']")));
            loginAndPassTab.Click();
            var loginInput = wait.Until(driver => driver.FindElement(By.CssSelector("input[name='login']")));
            loginInput.Clear();
            Thread.Sleep(10.Seconds());
            loginInput.SendKeys("aaa@aaa.com");
            var passwordInput = wait.Until(driver => driver.FindElement(By.CssSelector("input[type='password']")));
            passwordInput.Clear();
            Thread.Sleep(10.Seconds());
            passwordInput.SendKeys("qwe123");
            var signinButton = wait.Until(driver => driver.FindElement(By.CssSelector("div[data-tid='btn-login']")));
            signinButton.Click();
            wait.Until(driver => driver.Title == "LK");
            var fullnameSpan = wait.Until(driver => driver.FindElement(By.CssSelector("span[data-test-id='fullname-label']")));
            fullnameSpan.Text.Should().Be("GridTest");
        });
}

@Doofus100500
Copy link
Contributor Author

Can you try to set useCachedMetrics: false (or remove it) to see how it scales?

84

@VietND96
Copy link
Member

I guess if we increase the pollingInterval longer (e.g equal to how long a Node pod is coming up, get ready, registered to Hub). The number of replicas will be reduced too

@VietND96 VietND96 pinned this issue Nov 14, 2024
@Doofus100500
Copy link
Contributor Author

change pollingInterval to default https://keda.sh/docs/2.15/reference/scaledobject-spec/#pollinginterval
Now I'm getting 36 pods
But I think this might not be the right approach, for some reason, it used to scale more accurately before.

@VietND96
Copy link
Member

Did you also have a chance to try the ScaledJob behavior? Let me think on how to make deployment scales more accurately

@Doofus100500
Copy link
Contributor Author

I haven’t tried the ScaledJob behavior yet. I think it won’t be quick.

@Doofus100500
Copy link
Contributor Author

Doofus100500 commented Nov 19, 2024

@VietND96 Hi.
Are we sure we shouldn’t reopen this issue?

@VietND96
Copy link
Member

Yes, I reopened it to keep tracking.

@VietND96 VietND96 reopened this Nov 22, 2024
@Doofus100500
Copy link
Contributor Author

Switched scaling to ScaledJob behavior. So far, everything seems great, I think I’ll stick with it.

@VietND96
Copy link
Member

What do you think about the cons and pros of these 2 kinds of scaling types? I will probably collect some feedback and input to README to convince people which should be used :)

@Doofus100500
Copy link
Contributor Author

So far, I’ve noticed that scaling with deployments provided more accurate results in terms of the number of pods, as well as eliminated the startup delay when a session arrives, and the pod is already running.

@Doofus100500
Copy link
Contributor Author

Doofus100500 commented Nov 25, 2024

@VietND96 Hi! While working with jobs, I came up with the following approach that might be useful. Add the following to the video container:

lifecycle:
  preStop:
    exec:
      command:
      - bash
      - -c
      - "while pgrep -f 'java.*selenium' | grep -v $$; do sleep 5; done"

Then video container will not stop before the browser container

And we can add this to jobTargetRef:

activeDeadlineSeconds: 300

This will terminate erroneously created jobs.

@VietND96
Copy link
Member

@Doofus100500, set activeDeadlineSeconds for scaled Job, which I haven't tried. For people who have a long test execution, I am afraid that config will terminate the Job suddenly (similar to cooldown period in ScaledObject)

@VietND96
Copy link
Member

Btw, I raised a possible fix for the scaler via kedacore/keda#6368. In this repo, I will deliver the patch KEDA images for fast preview and evaluation.

@JorTurFer
Copy link

Hello,
I'm Jorge, one KEDA's maintainer 😄

What do you think about the cons and pros of these 2 kinds of scaling types?

The main difference is that using ScaledObjects, the pod's lifecycle is managed by the HPA controller, and it can produce unexpected disruptions when the pod removed (during scaling in) is executing a process.
In the other hand, ScaledJobs spawn multiple jobs to process the items in the queue, handling better the scaling in process because basically KEDA doesn't kill the jobs, it just waits until they finish and KEDA just cleans up finished jobs and spawn more. This isn't free of charge, spawning a container per item in the queue usually takes more time than having a workload already prepared to process the incoming traffic. When a job finishes, KEDA has to spawn other job to handle more items from the queue but a workload like a deployment is constantly processing items when it's free.

To mitigate the issues generated during the scaling in using ScaledObjects, the suggestion with selenium was registering a preStop hook that drains the node to finish current jobs and not take more. This approach is fine if the grace period is enough to safely drain the node.

@JorTurFer
Copy link

I'd expect that if the queue size is 50, it would add 50 pods, as it checks every 5s. Instead it adds a few pods only at the time and queue grows in size

This isn't totally correct. Scaling process has 2 phases, from/to zero and 1-N-1. For the scaling from/to zero phase, pollingInterval is the execution period, for the scaling 1-N-1 is the HPA controller the responsible for checking the value with a configurable value via admin parameters on the node (Cloud Managed Kubernetes don't expose this value and the usual value used is 15 seconds)

@miguel-cardoso-mindera
Copy link

Hey @JorTurFer thanks for your comment.

My point still stands, regardless of the time it takes to check, the queue is never dealt with for hours. I never see it create as many pods as there are jobs in queue. It slowly adds a couple pods each time, and as the devs schedule tests over time, the queue keeps growing or stays around the same size

@JorTurFer
Copy link

JorTurFer commented Nov 26, 2024

I'm checking (KEDA's) scaler logic, because probably the value returned to the HPA isn't correct, I'm not sure about some logic there, but I need to debug it. My suspect is that this statement isn't correct:
https://github.com/kedacore/keda/blob/29400ed2816fe7388a021ff14f5cb405b7deaa85/pkg/scalers/selenium_grid_scaler.go#L376-L380

It should return the total queue length (and we didn't notice it at that moment, sorry for that :/), including the jobs in progress and also the jobs pending, and I think that it's returning only the amount of nodes to create. This can work quite accurately with ScaledJobs using accurate strategy, but it's not the best, I'm checking.

@VietND96
Copy link
Member

I referring to GitHub runner scaler, which is similar to Selenium Grid in the use case. They also simply count the queue length (with some extra comparison against runners are available) and return the queue size
https://github.com/kedacore/keda/blob/29400ed2816fe7388a021ff14f5cb405b7deaa85/pkg/scalers/github_runner_scaler.go#L663C30-L663C52
For ScaledObject, I am still suspecting the condition in method GetMetricsAndActivity() is root cause.
For ScaledJobs strategy, recently in chart configs, I updated the default value to eager(#2466, not released yet). When refer to the example with incoming requests before/after poll (https://keda.sh/docs/2.16/reference/scaledjob-spec/#scalingstrategy)

@Doofus100500
Copy link
Contributor Author

set activeDeadlineSeconds for scaled Job, which I haven't tried. For people who have a long test execution, I am afraid that config will terminate the Job suddenly (similar to cooldown period in ScaledObject)

I forgot to mention that in all of this, terminationGracePeriodSeconds = 3600.

@Doofus100500
Copy link
Contributor Author

For ScaledJobs strategy, recently in chart configs, I updated the default value to eager

I tried switching the strategy to eager, but it resulted in a higher number of mistakenly created jobs when there was a large queue of sessions. (my pollingInterval = 20)

@VietND96
Copy link
Member

VietND96 commented Dec 2, 2024

I have discussed with @JorTurFer to understand what value the scaler should expose to KEDA. It should be the sum of the in-progress sessions plus the pending sessions. And the PR kedacore/keda#6368 updated the same.
I will publish the testing result (just small scale) via https://github.com/SeleniumHQ/docker-selenium/tree/trunk/.keda for reference.
For ScaledJobs strategy: default is expected to work perfectly without mistakenly creating jobs.

@VietND96
Copy link
Member

VietND96 commented Dec 3, 2024

Autoscaling test results are updated to https://github.com/SeleniumHQ/docker-selenium/tree/trunk/.keda

@VietND96 VietND96 closed this as completed Dec 4, 2024
@VietND96
Copy link
Member

KEDA core 2.16.1 is out with all the fixes needed. You can try the latest chart and verify.
Note that, in scaler trigger params, no more default value latest of browserVersion, and linux of platformName. By default, it is empty and lets the user input the value. Expect that value should be matched in request, Node stereotype, and scaler trigger params to get correct scale behavior. Read more details in PR kedacore/keda#6437

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants