[🐛 Bug]: Issue with Scaling Already Scaled Deployments #2464

Doofus100500 · 2024-11-13T06:59:10Z

What happened?

Hi, thank you so much for refactoring the grid scaler, it’s really awesome. However, I’m noticing issues with scaling already scaled deployments. Here’s an example:

I request 10 sessions, and 10 pods with browsers are launched. Then I request an additional 20 sessions, but only 10 more pods are added to the existing ones. This results in 10 sessions remaining in the queue until the previous sessions are completed.

I think the scaling logic should work as follows: if there are already 10 pods running, then when 20 more sessions are queued, the total should scale up to 30 pods.

keda version: v2.16.0

Command used to start Selenium Grid with Docker (or Kubernetes)

helm

Relevant log output

add a screenshot from grafana

Operating System

k8s

Docker Selenium version (image tag)

4.26.0-20241101

Selenium Grid chart version (chart version)

0.37.1

github-actions · 2024-11-13T06:59:20Z

@Doofus100500, thank you for creating this issue. We will troubleshoot it as soon as we can.

Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

VietND96 · 2024-11-13T08:55:10Z

I am not sure if this is a scaler logic issue or KEDA core behavior.
If possible, can you give me the GraphQL response in JSON when it is going to that situation? I want a data to test the scaler logic
Use this query for GraphQL

"query": "{ grid { sessionCount, maxSession, totalSlots }, nodesInfo { nodes { id, status, sessionCount, maxSession, slotCount, stereotypes, sessions { id, capabilities, slot { id, stereotype } } } }, sessionsInfo { sessionQueueRequests } }"

miguel-cardoso-mindera · 2024-11-13T11:25:21Z

I'm having similar issues, the scaler is not working properly as seen in the image:

ScaledObject:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: selenium4-in-cluster-local-selenium-chrome-scaledobject
  namespace: selenium
  labels:
    deploymentName: selenium4-in-cluster-local-selenium-chrome-node
spec:
  maxReplicaCount: 1000
  pollingInterval: 5
  scaleTargetRef:
    name: selenium4-in-cluster-local-selenium-chrome-node
  triggers:
    - type: selenium-grid
      metadata:
        browserName: 'chrome'
      authenticationRef:
        name: selenium4-in-cluster-local-selenium-scaler-trigger-auth

I'd expect that if the queue size is 50, it would add 50 pods, as it checks every 5s. Instead it adds a few pods only at the time and queue grows in size

VietND96 · 2024-11-14T04:42:46Z

I used the data to test scaler logic, and the count was returned correctly.
I tried to update the triggers config, which refers to the PR changes. Updated metricType from AverageValue to Value and enabled metrics cached.
A new chart version will be out soon for your evaluation.

VietND96 · 2024-11-14T08:21:53Z

So, it is scaling with the formula 10 x 10 ~= 99 instances?
What if we do not specify advanced in the resource spec?

Doofus100500 · 2024-11-14T08:30:00Z

Apparently, yes.
I did it like this:

spec:
  maxReplicaCount: 220
  minReplicaCount: 0
  pollingInterval: 5
  scaleTargetRef:
    kind: Deployment
    name: selenium-grid-selenium-chrome-node-v120
  triggers:
  - authenticationRef:
      name: selenium-grid-selenium-scaler-trigger-auth
    metadata:
      browserName: chrome
      browserVersion: "120"
      nodeMaxSessions: "1"
      platformName: linux
      sessionBrowserName: chrome
      unsafeSsl: "true"
    metricType: Value
    type: selenium-grid
    useCachedMetrics: true

I’m still getting 99, but it scales smoothly =)

VietND96 · 2024-11-14T08:39:58Z

Yes, but it seems not optimized. Let me find any configs needed for ScaledObject.
By the way, can you share your tests in a standalone script that I can use to reproduce and test the config?

VietND96 · 2024-11-14T08:46:24Z

Can you try to set useCachedMetrics: false (or remove it) to see how it scales?

Doofus100500 · 2024-11-14T08:48:09Z

Unfortunately, I can’t share it directly as it accesses an internal resource, but there’s nothing overly complex in it. Here it is (I removed sensitive information):

[Test]
public void EnterToPortalUserProfilePage()
{
    Execute(
        driver =>
        {
            var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));
            driver.Manage().Window.Maximize();
            Thread.Sleep(10.Seconds());
            driver.Navigate().GoToUrl("https://someurl/");
            var loginAndPassTab = wait.Until(driver => driver.FindElement(By.CssSelector("a[data-tid='tab_login']")));
            loginAndPassTab.Click();
            var loginInput = wait.Until(driver => driver.FindElement(By.CssSelector("input[name='login']")));
            loginInput.Clear();
            Thread.Sleep(10.Seconds());
            loginInput.SendKeys("aaa@aaa.com");
            var passwordInput = wait.Until(driver => driver.FindElement(By.CssSelector("input[type='password']")));
            passwordInput.Clear();
            Thread.Sleep(10.Seconds());
            passwordInput.SendKeys("qwe123");
            var signinButton = wait.Until(driver => driver.FindElement(By.CssSelector("div[data-tid='btn-login']")));
            signinButton.Click();
            wait.Until(driver => driver.Title == "LK");
            var fullnameSpan = wait.Until(driver => driver.FindElement(By.CssSelector("span[data-test-id='fullname-label']")));
            fullnameSpan.Text.Should().Be("GridTest");
        });
}

Doofus100500 · 2024-11-14T08:53:17Z

Can you try to set useCachedMetrics: false (or remove it) to see how it scales?

84

VietND96 · 2024-11-14T09:02:38Z

I guess if we increase the pollingInterval longer (e.g equal to how long a Node pod is coming up, get ready, registered to Hub). The number of replicas will be reduced too

Doofus100500 · 2024-11-14T09:14:08Z

change pollingInterval to default https://keda.sh/docs/2.15/reference/scaledobject-spec/#pollinginterval
Now I'm getting 36 pods
But I think this might not be the right approach, for some reason, it used to scale more accurately before.

VietND96 · 2024-11-14T10:04:51Z

Did you also have a chance to try the ScaledJob behavior? Let me think on how to make deployment scales more accurately

Doofus100500 · 2024-11-14T10:09:18Z

I haven’t tried the ScaledJob behavior yet. I think it won’t be quick.

Doofus100500 · 2024-11-19T08:49:43Z

@VietND96 Hi.
Are we sure we shouldn’t reopen this issue?

VietND96 · 2024-11-22T05:03:00Z

Yes, I reopened it to keep tracking.

Doofus100500 · 2024-11-22T08:24:05Z

Switched scaling to ScaledJob behavior. So far, everything seems great, I think I’ll stick with it.

VietND96 · 2024-11-22T08:32:42Z

What do you think about the cons and pros of these 2 kinds of scaling types? I will probably collect some feedback and input to README to convince people which should be used :)

Doofus100500 · 2024-11-25T09:32:27Z

So far, I’ve noticed that scaling with deployments provided more accurate results in terms of the number of pods, as well as eliminated the startup delay when a session arrives, and the pod is already running.

Doofus100500 · 2024-11-25T12:54:15Z

@VietND96 Hi! While working with jobs, I came up with the following approach that might be useful. Add the following to the video container:

lifecycle:
  preStop:
    exec:
      command:
      - bash
      - -c
      - "while pgrep -f 'java.*selenium' | grep -v $$; do sleep 5; done"

Then video container will not stop before the browser container

And we can add this to jobTargetRef:

activeDeadlineSeconds: 300

This will terminate erroneously created jobs.

VietND96 · 2024-11-26T19:56:20Z

@Doofus100500, set activeDeadlineSeconds for scaled Job, which I haven't tried. For people who have a long test execution, I am afraid that config will terminate the Job suddenly (similar to cooldown period in ScaledObject)

VietND96 · 2024-11-26T19:57:50Z

Btw, I raised a possible fix for the scaler via kedacore/keda#6368. In this repo, I will deliver the patch KEDA images for fast preview and evaluation.

JorTurFer · 2024-11-26T21:26:44Z

Hello,
I'm Jorge, one KEDA's maintainer 😄

What do you think about the cons and pros of these 2 kinds of scaling types?

The main difference is that using ScaledObjects, the pod's lifecycle is managed by the HPA controller, and it can produce unexpected disruptions when the pod removed (during scaling in) is executing a process.
In the other hand, ScaledJobs spawn multiple jobs to process the items in the queue, handling better the scaling in process because basically KEDA doesn't kill the jobs, it just waits until they finish and KEDA just cleans up finished jobs and spawn more. This isn't free of charge, spawning a container per item in the queue usually takes more time than having a workload already prepared to process the incoming traffic. When a job finishes, KEDA has to spawn other job to handle more items from the queue but a workload like a deployment is constantly processing items when it's free.

To mitigate the issues generated during the scaling in using ScaledObjects, the suggestion with selenium was registering a preStop hook that drains the node to finish current jobs and not take more. This approach is fine if the grace period is enough to safely drain the node.

JorTurFer · 2024-11-26T21:31:55Z

I'd expect that if the queue size is 50, it would add 50 pods, as it checks every 5s. Instead it adds a few pods only at the time and queue grows in size

This isn't totally correct. Scaling process has 2 phases, from/to zero and 1-N-1. For the scaling from/to zero phase, pollingInterval is the execution period, for the scaling 1-N-1 is the HPA controller the responsible for checking the value with a configurable value via admin parameters on the node (Cloud Managed Kubernetes don't expose this value and the usual value used is 15 seconds)

miguel-cardoso-mindera · 2024-11-26T21:37:26Z

Hey @JorTurFer thanks for your comment.

My point still stands, regardless of the time it takes to check, the queue is never dealt with for hours. I never see it create as many pods as there are jobs in queue. It slowly adds a couple pods each time, and as the devs schedule tests over time, the queue keeps growing or stays around the same size

JorTurFer · 2024-11-26T21:42:53Z

I'm checking (KEDA's) scaler logic, because probably the value returned to the HPA isn't correct, I'm not sure about some logic there, but I need to debug it. My suspect is that this statement isn't correct:
https://github.com/kedacore/keda/blob/29400ed2816fe7388a021ff14f5cb405b7deaa85/pkg/scalers/selenium_grid_scaler.go#L376-L380

It should return the total queue length (and we didn't notice it at that moment, sorry for that :/), including the jobs in progress and also the jobs pending, and I think that it's returning only the amount of nodes to create. This can work quite accurately with ScaledJobs using accurate strategy, but it's not the best, I'm checking.

VietND96 · 2024-11-27T00:19:37Z

I referring to GitHub runner scaler, which is similar to Selenium Grid in the use case. They also simply count the queue length (with some extra comparison against runners are available) and return the queue size
https://github.com/kedacore/keda/blob/29400ed2816fe7388a021ff14f5cb405b7deaa85/pkg/scalers/github_runner_scaler.go#L663C30-L663C52
For ScaledObject, I am still suspecting the condition in method GetMetricsAndActivity() is root cause.
For ScaledJobs strategy, recently in chart configs, I updated the default value to eager(#2466, not released yet). When refer to the example with incoming requests before/after poll (https://keda.sh/docs/2.16/reference/scaledjob-spec/#scalingstrategy)

Doofus100500 · 2024-11-27T08:59:17Z

set activeDeadlineSeconds for scaled Job, which I haven't tried. For people who have a long test execution, I am afraid that config will terminate the Job suddenly (similar to cooldown period in ScaledObject)

I forgot to mention that in all of this, terminationGracePeriodSeconds = 3600.

Doofus100500 · 2024-11-27T10:21:23Z

For ScaledJobs strategy, recently in chart configs, I updated the default value to eager

I tried switching the strategy to eager, but it resulted in a higher number of mistakenly created jobs when there was a large queue of sessions. (my pollingInterval = 20)

VietND96 · 2024-12-02T09:11:30Z

I have discussed with @JorTurFer to understand what value the scaler should expose to KEDA. It should be the sum of the in-progress sessions plus the pending sessions. And the PR kedacore/keda#6368 updated the same.
I will publish the testing result (just small scale) via https://github.com/SeleniumHQ/docker-selenium/tree/trunk/.keda for reference.
For ScaledJobs strategy: default is expected to work perfectly without mistakenly creating jobs.

VietND96 · 2024-12-03T08:02:42Z

Autoscaling test results are updated to https://github.com/SeleniumHQ/docker-selenium/tree/trunk/.keda

VietND96 · 2024-12-26T09:09:56Z

KEDA core 2.16.1 is out with all the fixes needed. You can try the latest chart and verify.
Note that, in scaler trigger params, no more default value latest of browserVersion, and linux of platformName. By default, it is empty and lets the user input the value. Expect that value should be matched in request, Node stereotype, and scaler trigger params to get correct scale behavior. Read more details in PR kedacore/keda#6437

Doofus100500 added the needs-triaging label Nov 13, 2024

This comment was marked as resolved.

Sign in to view

VietND96 mentioned this issue Nov 14, 2024

chart: Deployment scale metricType should be Value instead of AverageValue #2465

Merged

8 tasks

VietND96 closed this as completed in #2465 Nov 14, 2024

This comment was marked as resolved.

Sign in to view

VietND96 pinned this issue Nov 14, 2024

VietND96 reopened this Nov 22, 2024

VietND96 mentioned this issue Nov 26, 2024

fix: Selenium Grid scaler exposes sum of pending and ongoing sessions to KDEA kedacore/keda#6368

Merged

4 tasks

VietND96 closed this as completed Dec 4, 2024

farioas mentioned this issue Jan 3, 2025

[🐛 Bug]: Autoscaling jobs issue after keda 2.16.1 upgrade #2542

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 Bug]: Issue with Scaling Already Scaled Deployments #2464

[🐛 Bug]: Issue with Scaling Already Scaled Deployments #2464

Doofus100500 commented Nov 13, 2024

github-actions bot commented Nov 13, 2024

VietND96 commented Nov 13, 2024

This comment was marked as resolved.

miguel-cardoso-mindera commented Nov 13, 2024

VietND96 commented Nov 14, 2024

This comment was marked as resolved.

VietND96 commented Nov 14, 2024

Doofus100500 commented Nov 14, 2024

VietND96 commented Nov 14, 2024

VietND96 commented Nov 14, 2024

Doofus100500 commented Nov 14, 2024

Doofus100500 commented Nov 14, 2024

VietND96 commented Nov 14, 2024

Doofus100500 commented Nov 14, 2024

VietND96 commented Nov 14, 2024

Doofus100500 commented Nov 14, 2024

Doofus100500 commented Nov 19, 2024 •

edited

Loading

VietND96 commented Nov 22, 2024

Doofus100500 commented Nov 22, 2024

VietND96 commented Nov 22, 2024

Doofus100500 commented Nov 25, 2024

Doofus100500 commented Nov 25, 2024 •

edited

Loading

VietND96 commented Nov 26, 2024

VietND96 commented Nov 26, 2024

JorTurFer commented Nov 26, 2024

JorTurFer commented Nov 26, 2024

miguel-cardoso-mindera commented Nov 26, 2024

JorTurFer commented Nov 26, 2024 •

edited

Loading

VietND96 commented Nov 27, 2024

Doofus100500 commented Nov 27, 2024

Doofus100500 commented Nov 27, 2024

VietND96 commented Dec 2, 2024

VietND96 commented Dec 3, 2024

VietND96 commented Dec 26, 2024

[🐛 Bug]: Issue with Scaling Already Scaled Deployments #2464

[🐛 Bug]: Issue with Scaling Already Scaled Deployments #2464

Comments

Doofus100500 commented Nov 13, 2024

What happened?

Command used to start Selenium Grid with Docker (or Kubernetes)

Relevant log output

Operating System

Docker Selenium version (image tag)

Selenium Grid chart version (chart version)

github-actions bot commented Nov 13, 2024

VietND96 commented Nov 13, 2024

This comment was marked as resolved.

miguel-cardoso-mindera commented Nov 13, 2024

VietND96 commented Nov 14, 2024

This comment was marked as resolved.

VietND96 commented Nov 14, 2024

Doofus100500 commented Nov 14, 2024

VietND96 commented Nov 14, 2024

VietND96 commented Nov 14, 2024

Doofus100500 commented Nov 14, 2024

Doofus100500 commented Nov 14, 2024

VietND96 commented Nov 14, 2024

Doofus100500 commented Nov 14, 2024

VietND96 commented Nov 14, 2024

Doofus100500 commented Nov 14, 2024

Doofus100500 commented Nov 19, 2024 • edited Loading

VietND96 commented Nov 22, 2024

Doofus100500 commented Nov 22, 2024

VietND96 commented Nov 22, 2024

Doofus100500 commented Nov 25, 2024

Doofus100500 commented Nov 25, 2024 • edited Loading

VietND96 commented Nov 26, 2024

VietND96 commented Nov 26, 2024

JorTurFer commented Nov 26, 2024

JorTurFer commented Nov 26, 2024

miguel-cardoso-mindera commented Nov 26, 2024

JorTurFer commented Nov 26, 2024 • edited Loading

VietND96 commented Nov 27, 2024

Doofus100500 commented Nov 27, 2024

Doofus100500 commented Nov 27, 2024

VietND96 commented Dec 2, 2024

VietND96 commented Dec 3, 2024

VietND96 commented Dec 26, 2024

Doofus100500 commented Nov 19, 2024 •

edited

Loading

Doofus100500 commented Nov 25, 2024 •

edited

Loading

JorTurFer commented Nov 26, 2024 •

edited

Loading