-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[occm] Multi region openstack cluster #2595
base: master
Are you sure you want to change the base?
[occm] Multi region openstack cluster #2595
Conversation
Hi @sergelogvinov. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@mdbooth can you take a look on this PR. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't taken a deep look into this, but I much prefer this in principal: it only tells cloud-provider things we know to be true. This makes me much more confident that this will continue to work correctly as cloud-provider evolves.
Would you still run multiple CCMs, or switch to a single active CCM?
pkg/openstack/instancesv2.go
Outdated
for _, region := range os.regions { | ||
opt := os.epOpts | ||
opt.Region = region | ||
|
||
compute[region], err = client.NewComputeV2(os.provider, opt) | ||
if err != nil { | ||
klog.Errorf("unable to access compute v2 API : %v", err) | ||
return nil, false | ||
} | ||
|
||
network[region], err = client.NewNetworkV2(os.provider, opt) | ||
if err != nil { | ||
klog.Errorf("unable to access network v2 API : %v", err) | ||
return nil, false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pierreprinetti how much work is performed when initialising a new service client? Is it local-only, or do we have to go back to keystone?
I might be inclined to intialise this lazily anyway, tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar thought, maybe init them until real usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had one issue in proxmox with lazy initialization. The regions cannot exist, and during the rollout of OCCM, it starts without errors. The kubernetes administrator will think that all configuration is correct.
So we can check all regions here and crush if needed. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the late response. Building a ProviderClient requires a Keystone roundtrip; building ServiceClients is cheap.
pkg/openstack/instancesv2.go
Outdated
if node.Spec.ProviderID == "" { | ||
return i.getInstanceByName(node) | ||
} | ||
|
||
instanceID, instanceRegion, err := instanceIDFromProviderID(node.Spec.ProviderID) | ||
if err != nil { | ||
return nil, err | ||
return nil, "", err | ||
} | ||
|
||
if instanceRegion == "" { | ||
return i.getInstanceByID(instanceID, node.Labels[v1.LabelTopologyRegion]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably be a bit more explicit with where we're looking for stuff here. IIUC there are 2 possible places we can get a specific region from:
- providerID
- LabelTopologyRegion
Both may be unset because the node has not yet been adopted by the node-controller.
providerID may not contain a region either because it was set before we became multi-region, or because it was set by kubelet without a region and it's immutable.
But the end result is that either we know the region or we don't. If we know the region we should look only in that region. If we don't know the region we should look everywhere.
How about logic something like:
instanceID, instanceRegion, err := instanceIDFromProviderID(node.Spec.ProviderID)
..err omitted...
if instanceRegion == "" {
instanceRegion = node.Labels[v1.LabelTopologyRegion]
}
var searchRegions []string
if instanceRegion != "" {
if !slices.Contains(i.regions, instanceRegion) {
return ...bad region error...
}
searchRegions = []string{instanceRegion}
} else {
searchRegions = ..all the regions, preferred first...
}
for region := range searchRegions {
mc := ...
if instanceID != "" {
getInstanceByID()
} else {
getInstanceByName()
}
mc.ObserveRequest()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this very very good idea.
I've changed the implementation. But one thought - i cannot trust LabelTopologyRegion
, something can change it and node-lifecycle will remove node (for instance on reboot/upgrade event)...
So i can use LabelTopologyRegion only as prefered-region. And check this region first.
Thanks.
Hi @sergelogvinov |
Thank you for this PR, it is very interesting. Can we have a call/chat in slack #provider-openstack (Serge Logvinov)? |
aee43df
to
fa2bd50
Compare
/ok-to-test |
/ok-to-test |
docs/openstack-cloud-controller-manager/using-openstack-cloud-controller-manager.md
Outdated
Show resolved
Hide resolved
pkg/openstack/instancesv2.go
Outdated
for _, region := range os.regions { | ||
opt := os.epOpts | ||
opt.Region = region | ||
|
||
compute[region], err = client.NewComputeV2(os.provider, opt) | ||
if err != nil { | ||
klog.Errorf("unable to access compute v2 API : %v", err) | ||
return nil, false | ||
} | ||
|
||
network[region], err = client.NewNetworkV2(os.provider, opt) | ||
if err != nil { | ||
klog.Errorf("unable to access network v2 API : %v", err) | ||
return nil, false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar thought, maybe init them until real usage?
fa2bd50
to
46aebfb
Compare
46aebfb
to
6efe3b4
Compare
6efe3b4
to
2ea04e3
Compare
2ea04e3
to
6da4422
Compare
6da4422
to
2f86fa7
Compare
Is anything else we can do here? @jichenjc @mdbooth @kayrus We had conversation how we need initialize the openstack clients for _, region := range os.regions {
opt := os.epOpts
opt.Region = region
compute[region], err = client.NewComputeV2(os.provider, opt)
if err != nil {
klog.Errorf("unable to access compute v2 API : %v", err)
return nil, false
}
network[region], err = client.NewNetworkV2(os.provider, opt)
if err != nil {
klog.Errorf("unable to access network v2 API : %v", err)
return nil, false
} It seems to be a similar process to the one we followed in [Global]
auth-url="https://auth.cloud.openstackcluster.region-default.local/v3"
username="region-default-username"
password="region-default-password"
region="default"
tenant-id="region-default-tenant-id"
tenant-name="region-default-tenant-name"
domain-name="Default"
[Global "region-one"]
auth-url="https://auth.cloud.openstackcluster.region-one.local/v3"
username="region-one-username"
password="region-one-password"
region="one"
tenant-id="region-one-tenant-id"
tenant-name="region-one-tenant-name"
domain-name="Default" Thanks. |
2f86fa7
to
3c8e594
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f29e427
to
d2e2b76
Compare
/retest |
/test openstack-cloud-controller-manager-e2e-test |
Thank you @kayrus for refactoring of instance.go (instance-v2). Whenever you have a moment, could you please take a look at this? |
d2e2b76
to
db35d48
Compare
@sergelogvinov hm, I have doubts about this PR so far. Some raw thoughts:
Sorry, I don't have a multicloud/multiregion setup, and the use case is not really clear for me. |
Thank you @kayrus for reviewing my PR. I completely agree with your point about hybrid/multi-cluster setups (bare metal + OpenStack + AWS + GCP, etc.). This should first be implemented in https://github.com/kubernetes/cloud-provider first. I hope it will be added someday. Our case is a bit different: Imagine an OpenStack setup with only one Keystone endpoint but multiple regions for services like Nova, Neutron, Cinder, and Glance. Each region has one available zone called "nova" (the default installation). I think this type of setup is easier to manage and upgrade. I've seen similar setups used by many cloud providers. So, in this case, it’s not fully separated OpenStack clusters. It looks more like one region with many av-zones, similar to how well-known cloud providers organize their systems. That’s why this PR supports only one Keystone endpoint. Using this endpoint, we can get Nova/Neutron endpoints for each region. This feature will be introduced as an alpha feature and can be enabled using a CLI flag. Other components like the load balancer, cinder-csi and manila-csi will also need updates. However, we need to focus on supporting the cloud-node and cloud-node-lifecycle controllers first. PS. I think I forgot to check backward compatibility with non regional OpenStack setups. I will check this soon. |
ccec72b
to
bfce454
Compare
Currently, it supports only single auth section. Set the regions in config as: [Global] region=REGION1 regions=REGION1 regions=REGION2 regions=REGION3
bfce454
to
dcd162b
Compare
I’ve updated the documentation and verified the setup. It works well both with OS_CCM_REGIONAL=true enabled and without it. Could you please let me know if the documentation is clear or if there’s anything that needs improvement? Thank you! @kayrus |
Once everything else is clarified, it may be nice to have a release note |
Hi, @MatthieuFin Could you please take a look at this, since you know hybrid cloud? |
Hi, To handle node lifecycle with OCCM, I deploy 1 OCCM DaemonSet per OpenStack cluster (with env variable I'll try to allowed some time to test your implementation and see how I could adapt it to be able to handle differents keystones, but since this is not a blocking point for me, I don't promise anything. My next needed improvement will probably to be able to select which OpenStack cloud should be use to provide a LoadBalancer svc probably based on annotations or by Class name I didn't take a look about this point for now (currently all my LB svc are provided by only 1 OpenStack cluster with LB feature enable on corresponding OCCM and disable on others) |
What this PR does / why we need it:
Openstack CCM multi region support, if it has one Identity provider.
The OpenStack cluster includes a single Keystone service and multiple Nova, Cinder, and Neutron services grouped by region.
Which issue this PR fixes(if applicable):
fixes #1924
Special notes for reviewers:
CCM config changes:
The
region
is required param (as was before) and it uses as default region in cluster.The
regions
can set multiple times, they will merge withregion
param. So the value of theregion
may or may not exist in the list of defined regions.[Global] auth-url=https://auth.openstack.example.com/v3/ region=REGION1 # new param 'regions' can be specified multiple times regions=REGION1 regions=REGION2 regions=REGION3
Optionally can be set in cloud.conf, it supports only one auth service (Keystone)
During the initialization process, OCCM checks for the existence of providerID. If providerID does not exist, it defaults to using
node.name
, as it did previously. Additionally, if the node has the labeltopology.kubernetes.io/region
, OCCM will prioritize using this region as the first one to check. This approach ensures that in the event of a region outage, OCCM can continue to function.In addition, we can assist CCM in locating the node by providing
kubelet
parameters:--provider-id=openstack:///$InstanceID
- InstanceID exists in metadata--provider-id=openstack://$REGION/$InstanceID
- if you can define the region (by default meta server does not have this information)--node-labels=topology.kubernetes.io/region=$REGION
set preferred REGION in label, OCCM will then prioritize searching for the node in this specified regionThe OCCM sets
ProviderID
:OCCM with multi regions can work with/without
env.OS_CCM_REGIONAL=true
Release note: