-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[🐛 Bug]: Issue with Scaling Already Scaled Deployments #2464
Comments
@Doofus100500, thank you for creating this issue. We will troubleshoot it as soon as we can. Info for maintainersTriage this issue by using labels.
If information is missing, add a helpful comment and then
If the issue is a question, add the
If the issue is valid but there is no time to troubleshoot it, consider adding the
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable
After troubleshooting the issue, please add the Thank you! |
I am not sure if this is a scaler logic issue or KEDA core behavior.
|
This comment was marked as resolved.
This comment was marked as resolved.
I used the data to test scaler logic, and the count was returned correctly. |
This comment was marked as resolved.
This comment was marked as resolved.
So, it is scaling with the formula 10 x 10 ~= 99 instances? |
Apparently, yes.
I’m still getting 99, but it scales smoothly =) |
Yes, but it seems not optimized. Let me find any configs needed for ScaledObject. |
Can you try to set |
Unfortunately, I can’t share it directly as it accesses an internal resource, but there’s nothing overly complex in it. Here it is (I removed sensitive information):
|
84 |
I guess if we increase the |
change pollingInterval to default https://keda.sh/docs/2.15/reference/scaledobject-spec/#pollinginterval |
Did you also have a chance to try the ScaledJob behavior? Let me think on how to make deployment scales more accurately |
I haven’t tried the ScaledJob behavior yet. I think it won’t be quick. |
@VietND96 Hi. |
Yes, I reopened it to keep tracking. |
Switched scaling to ScaledJob behavior. So far, everything seems great, I think I’ll stick with it. |
What do you think about the cons and pros of these 2 kinds of scaling types? I will probably collect some feedback and input to README to convince people which should be used :) |
So far, I’ve noticed that scaling with deployments provided more accurate results in terms of the number of pods, as well as eliminated the startup delay when a session arrives, and the pod is already running. |
@VietND96 Hi! While working with jobs, I came up with the following approach that might be useful. Add the following to the video container:
Then video container will not stop before the browser container And we can add this to jobTargetRef:
This will terminate erroneously created jobs. |
@Doofus100500, set |
Btw, I raised a possible fix for the scaler via kedacore/keda#6368. In this repo, I will deliver the patch KEDA images for fast preview and evaluation. |
Hello,
The main difference is that using ScaledObjects, the pod's lifecycle is managed by the HPA controller, and it can produce unexpected disruptions when the pod removed (during scaling in) is executing a process. To mitigate the issues generated during the scaling in using ScaledObjects, the suggestion with selenium was registering a preStop hook that drains the node to finish current jobs and not take more. This approach is fine if the grace period is enough to safely drain the node. |
This isn't totally correct. Scaling process has 2 phases, from/to zero and 1-N-1. For the scaling from/to zero phase, |
Hey @JorTurFer thanks for your comment. My point still stands, regardless of the time it takes to check, the queue is never dealt with for hours. I never see it create as many pods as there are jobs in queue. It slowly adds a couple pods each time, and as the devs schedule tests over time, the queue keeps growing or stays around the same size |
I'm checking (KEDA's) scaler logic, because probably the value returned to the HPA isn't correct, I'm not sure about some logic there, but I need to debug it. My suspect is that this statement isn't correct: It should return the total queue length (and we didn't notice it at that moment, sorry for that :/), including the jobs in progress and also the jobs pending, and I think that it's returning only the amount of nodes to create. This can work quite accurately with ScaledJobs using |
I referring to GitHub runner scaler, which is similar to Selenium Grid in the use case. They also simply count the queue length (with some extra comparison against runners are available) and return the queue size |
I forgot to mention that in all of this, terminationGracePeriodSeconds = 3600. |
I tried switching the strategy to eager, but it resulted in a higher number of mistakenly created jobs when there was a large queue of sessions. (my pollingInterval = 20) |
I have discussed with @JorTurFer to understand what value the scaler should expose to KEDA. It should be the sum of the in-progress sessions plus the pending sessions. And the PR kedacore/keda#6368 updated the same. |
Autoscaling test results are updated to https://github.com/SeleniumHQ/docker-selenium/tree/trunk/.keda |
KEDA core 2.16.1 is out with all the fixes needed. You can try the latest chart and verify. |
What happened?
Hi, thank you so much for refactoring the grid scaler, it’s really awesome. However, I’m noticing issues with scaling already scaled deployments. Here’s an example:
I request 10 sessions, and 10 pods with browsers are launched. Then I request an additional 20 sessions, but only 10 more pods are added to the existing ones. This results in 10 sessions remaining in the queue until the previous sessions are completed.
I think the scaling logic should work as follows: if there are already 10 pods running, then when 20 more sessions are queued, the total should scale up to 30 pods.
keda version: v2.16.0
Command used to start Selenium Grid with Docker (or Kubernetes)
Relevant log output
Operating System
k8s
Docker Selenium version (image tag)
4.26.0-20241101
Selenium Grid chart version (chart version)
0.37.1
The text was updated successfully, but these errors were encountered: