Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling Orchestration #146

Open
meowveen opened this issue Dec 16, 2019 · 6 comments
Open

Scaling Orchestration #146

meowveen opened this issue Dec 16, 2019 · 6 comments

Comments

@meowveen
Copy link

Hi,

I have a question regarding scaling the parallelism value.

For now, we are doing the manual way, whereby we edit the FlinkApplication YAML to change the parallelism value, apply the YAML, and let the operator transition from the old job cluster to the newly created job cluster.

Is there any recommended way to automatically orchestrate the above, similar to how HPA works in K8S, whereby HPA will kick in once the metrics exceed a certain value?

@anandswaminathan
Copy link
Contributor

I am not aware of something that is available (or straightforward) to achieve this. We have wanted to build some functionality like this in the Operator, we are yet to design it.

@jirislav
Copy link

jirislav commented Feb 24, 2021

Hi, we would also be interested into an auto-scaling feature of the operator, since we often need to solve the problem of too low resources, resulting in high back-pressure and a need for a manual action.

Do you have some design yet? Or would you prefer to collaborate on this design?


One idea is to compare Kafka message timestamp (or custom message payload timestamp) with current time, which has some obvious drawbacks, of course, but as an MVP I think it's excellent, as it would be optional.

  • the flink application would have to publish a metric, which computes the time difference
  • the k8s operator would just fetch this metric and would up-scale the deployment if needed
  • another issue is that it would have no way of knowing when to down-scale, but as an MVP, it's good (note that in our environment, the amount of processed messages consistently increases, so in our case, up-scale only auto-scale would be much appreciated)

@JRemitz
Copy link

JRemitz commented Aug 24, 2021

It seems this thread is out of date. Since May, 2021, the Flink team has introduced Reactive Mode with Flink v1.13. This supports horizontal pod autoscaling as long as it is supported within your Kubernetes cluster (metrics-server deployed as well).

If I'm thinking through this correctly, we already have access to the Flink configuration updates to enable reactive mode:

scheduler-mode: reactive

and by deploying a container with Flink v1.13+. The only additional need would be to include HPA with the Operator when it creates the necessary resources, and any rbac associated with creating that.

I can test appending HPA on after the fact, I haven't yet. But this should be enough in theory to get this working unless there are other Operator considerations that need to be made for managing the deployment with HPA.

More info on Reactive Mode.

@anandswaminathan
Copy link
Contributor

@JRemitz Sounds good. I guess there is one scenario, the deployment object created by the operator will need min/max. no ?

@JRemitz
Copy link

JRemitz commented Aug 25, 2021

Yes, really the entire HPA definition for CPU/memory rules. I'm not overly familiar with the CRD structure. Can you reference existing API objects as complex structures for validation?

@anandswaminathan
Copy link
Contributor

Check this PR for reference: https://github.com/lyft/flinkk8soperator/pull/233/files

https://github.com/lyft/flinkk8soperator/blob/master/deploy/crd.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants