-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added min_tickets to the bulkheads #248
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main thought is on the need for min_tickets
, especially if one is expecting a rather homogeneous setup (in terms of number of workers on a machine). In that case, couldn't one ensure that quota
is set such that you wouldn't need a min_tickets
?
For example, suppose we have 10 workers, quota
set to 0.1, and min_tickets
set to 2. Why would we use min_tickets
over setting quota
to 0.2?
[EDIT]
On the first read, I missed "small worker counts". Do we expect heterogenous setups (in terms of worker counts)? If so, similar question: why not ensure at least one ticket if the worker count is that low?
|
||
for id in $IPCS_Q; do | ||
ipcrm -q $id; | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also tack on an xargs
for all of the above:
ipcs -s | egrep "0x[0-9a-f]+ [0-9]+" | grep $ME | cut -f2 -d" " | xargs -n1 ipcrm -s
ipcs -m | egrep "0x[0-9a-f]+ [0-9]+" | grep $ME | cut -f2 -d" " | xargs -n1 ipcrm -m
ipcs -q | egrep "0x[0-9a-f]+ [0-9]+" | grep $ME | cut -f2 -d" " | xargs -n1 ipcrm -q
I don't think we can assume anything, but for Shopify Core, we expect pretty homogeneous setups at steady state (except a brief setup period).
Even in heterogenous setups, we still guarantee at least one ticket, because of the @thegedge: Did I understand your question correctly? |
Yep, so I guess my question is clarification on why we need |
This was driven by the comment referenced in the description, which was authored by @dalehamel in https://github.com/Shopify/shopify/commit/35045a4ab3833c43d36f77b1aac4aadbba152646. Maybe he can chime in? Since it's 2+ years old, it predates our move to the cloud, so it might be irrelevant now. |
@michaelkipper a little backstory The quota functionality was actually added because of our move to kubernetes (and thus the cloud). In our data center, we could reliably and predictably determine how many workers would be allocated per host, so tickets being configured statically made sense. With kubernetes making the orchestration decisions, we wouldn't be able to effectively use static tickets anymore. Eg, if we set 10 tickets for a resource per host before, but in the new model end up with only one worker, it will never actually trip a bulkhead. Likewise, if we ended up with only 10 tickets and 60 workers, there might be too much contention. The move to a quota of 0.5 is something i've pondered a lot. I figured that it must get out of date at some point, as we continue to scale, our bulkheads become less and less effective. I'm not convinced that the quota strategy is a sustainable one. You're correct that we always need at least one ticket, in which case the bulkhead is essentially ineffective as there is only one worker, it should always be able to get a ticket. As to the actual math to tune the quota ratio, I imagine we'd want to do it as some ratio of the number of workers to the resource globally across the whole fleet, and with any luck the gods of statistics will favor us, and the workloads will be distributed in an effective way across the fleet. I suspect that a quota that is near 0.5 will probably be ineffective, we likely want something much lower in many cases now, as our scale has increased by some number of factors since it was originally set. At the time, the 0.5 quota wasn't too dissimilar from many of the static ticket counts we had set. |
694b187
to
09d5c03
Compare
What
This PR adds a
min_tickets
argument to the Semian bulkheads.Why
At Shopify, our Semian configuration for MySQL sets
quota
to0.51
and has this comment:In a multiple-outage scenario, this means that we can lose 100% of capacity for the duration of a single client timeout if those resources become unhealthy at roughly the same time (which was the case we saw in a recent service disruption).
We'd like to reduce that 51% to something else, without losing causing the bulkheads to become overly sensitive at small worker counts. At Shopify, our core application runs with 16 workers, so this is likely not an issue except at startup, but still needs a proper solution.
How
We compute the number of available tickets as the minimum of the number of workers, or the desired ticket count (which itself is the maximum of the computed ticket count and the
min_tickets
argument).TODO: Documentation.
cc: @Shopify/servcomm