Skip to content
This repository has been archived by the owner on Oct 22, 2024. It is now read-only.

Keep retry registering node controller with RegistryServer #184

Merged
merged 2 commits into from
Mar 11, 2019

Conversation

avalluri
Copy link
Contributor

@avalluri avalluri commented Mar 4, 2019

Instead of exiting with error, driver has to wait and retry 'node controller'
registration till the registry server up and registration get succeed.

FIXES: #182

Signed-off-by: Amarnath Valluri amarnath.valluri@intel.com

@okartau
Copy link
Contributor

okartau commented Mar 4, 2019

good improvement!
retrytimeout was just not used and was set again to 10 sec it seems.. good fix as well.

@okartau
Copy link
Contributor

okartau commented Mar 4, 2019

Confirmed, works fine in my "make test" cluster

@avalluri
Copy link
Contributor Author

avalluri commented Mar 5, 2019

@pohly is it good enough to merge?

pkg/pmem-csi-driver/pmem-csi-driver.go Show resolved Hide resolved
if err := pmemd.registerNodeController(); err != nil {
return err
}
go pmemd.registerNodeController()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows the driver to go into a "ready" state (= accepting CSI calls) before it has successfully registered. Is that what we want? We don't have readiness probes at the moment, but if we had something based on the driver responding to CSI requests, then this would incorrectly report "ready" when in reality the driver is still starting up.

It's just a gut feeling, but I think I prefer calling pmemd.registerNodeController() in blocking mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what happens if registration fails? Does the driver now keep running despite the failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what happens if registration fails? Does the driver now keep running despite the failure?

Yes, i would keep retry. In reality, the only valid path is missing arguments, both are validated at client side.

But, i agree, it should fail on real error case. Now i fixed this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows the driver to go into a "ready" state (= accepting CSI calls) before it has successfully registered. Is that what we want? We don't have readiness probes at the moment, but if we had something based on the driver responding to CSI requests, then this would incorrectly report "ready" when in reality the driver is still starting up.

It's just a gut feeling, but I think I prefer calling pmemd.registerNodeController() in blocking mode.

Moved this to blocking call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a closer look at how pmemd.registerNodeController is called. Making it a blocking call is not enough, it also needs to be moved above the s.Start calls, otherwise the gRPC server is already accepting incoming requests while the registration is still pending.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if my earlier comment sounded like "node registration failure can be ignored". That was not my intention. It was in the context of your suggestion to move pmemd.registerNodeController() before starting gRPC {identity, controller, node}servers. If we start controller after registration, Any requests to this node controller made by Master from its registration handler might fail. So, we should start ControllerServer by the time we call registry.RegisterController()

Starting identity and node servers prior to registration, which allows to register the driver at node with kubelet, prior to registration. Does this leads to issues? I couldn't see any.

And after reading your comments, i felt like we should not keep retry to connect forever, fail after may be n-tries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we start controller after registration, Any requests to this node controller made by Master from its registration handler might fail.

No, they don't. The listening socket is open, so connections from master get accepted and queued. The response will be generated as soon as the driver continues from "registering myself" to "serving requests".

Starting identity and node servers prior to registration, which allows to register the driver at node with kubelet, prior to registration. Does this leads to issues? I couldn't see any.

My idea was to not create the Unix domain csi.sock and thus indicating to sidecars that the driver is not ready yet. However, that alone doesn't help an admin, because the sidecars will be running. What we need is a "readiness probe" sidecar. I'll ask in the CSI standup meeting why they implemented a "liveness probe" sidecar, but not a "readiness probe". IMHO "readiness" is more important.

And after reading your comments, i felt like we should not keep retry to connect forever, fail after may be n-tries?

I still think trying to connect forever is the right approach, in particular if an admin can tell that the driver is still initializing (see "readiness" above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they don't. The listening socket is open,

Listening Endpoint socket is only open in Start() call, without calling that no listening socket and if Master controller tries to connect to "Endpoint" we asked to register from its registration handler might fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget what I said then. I was thinking of "Start" as "start gRPC server" because that is how it is done elsewhere (if I remember correctly), but here it is indeed different.

We probably need to revisit this entire "bring up pmem-csi" topic sometime on the future when considering what it means for pmem-csi to be "ready" and what different failure scenarios could be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i did what the best we can do with current code:

  • Start ControllerServer
  • Initiate Register, exit if fail.
  • Start identity and node servers.

This is useful for the caller to handle the error cases.

Signed-off-by: Amarnath Valluri <amarnath.valluri@intel.com>
@avalluri avalluri force-pushed the retry-registration branch from dd5500a to a686d9a Compare March 6, 2019 13:17
pkg/pmem-csi-driver/pmem-csi-driver.go Outdated Show resolved Hide resolved
pkg/pmem-csi-driver/pmem-csi-driver.go Outdated Show resolved Hide resolved
@avalluri avalluri force-pushed the retry-registration branch from a686d9a to 15ccdc6 Compare March 7, 2019 08:31

if _, err = client.RegisterController(ctx, &req); err != nil {
return fmt.Errorf("Fail to register with server at '%s' : %s", pmemd.cfg.RegistryEndpoint, err.Error())
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still problematic. Consider what happens after 10 minutes: all gRPC calls will immediately time out and the for loop will keep running forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i agree. What about using no timeout context - WithCancel().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed usingWithCancel()context

Instead of exiting with error, driver has to wait and retry 'node controller'
registration till the registry server up and registration get succeed.

FIXES: intel#182

Signed-off-by: Amarnath Valluri <amarnath.valluri@intel.com>
@avalluri avalluri force-pushed the retry-registration branch from 15ccdc6 to ea3a44f Compare March 8, 2019 14:07
@avalluri avalluri requested a review from pohly March 8, 2019 14:09
@avalluri avalluri merged commit 586ae28 into intel:devel Mar 11, 2019
@avalluri avalluri deleted the retry-registration branch March 15, 2019 11:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants