Keep retry registering node controller with RegistryServer #184

avalluri · 2019-03-04T14:05:29Z

Instead of exiting with error, driver has to wait and retry 'node controller'
registration till the registry server up and registration get succeed.

FIXES: #182

Signed-off-by: Amarnath Valluri amarnath.valluri@intel.com

okartau · 2019-03-04T15:03:34Z

good improvement!
retrytimeout was just not used and was set again to 10 sec it seems.. good fix as well.

okartau · 2019-03-04T21:08:24Z

Confirmed, works fine in my "make test" cluster

avalluri · 2019-03-05T13:05:21Z

@pohly is it good enough to merge?

pkg/pmem-csi-driver/pmem-csi-driver.go

pohly · 2019-03-06T09:51:14Z

pkg/pmem-csi-driver/pmem-csi-driver.go

-			if err := pmemd.registerNodeController(); err != nil {
-				return err
-			}
+			go pmemd.registerNodeController()


This allows the driver to go into a "ready" state (= accepting CSI calls) before it has successfully registered. Is that what we want? We don't have readiness probes at the moment, but if we had something based on the driver responding to CSI requests, then this would incorrectly report "ready" when in reality the driver is still starting up.

It's just a gut feeling, but I think I prefer calling pmemd.registerNodeController() in blocking mode.

Also, what happens if registration fails? Does the driver now keep running despite the failure?

Also, what happens if registration fails? Does the driver now keep running despite the failure?

Yes, i would keep retry. In reality, the only valid path is missing arguments, both are validated at client side.

But, i agree, it should fail on real error case. Now i fixed this.

This allows the driver to go into a "ready" state (= accepting CSI calls) before it has successfully registered. Is that what we want? We don't have readiness probes at the moment, but if we had something based on the driver responding to CSI requests, then this would incorrectly report "ready" when in reality the driver is still starting up.

It's just a gut feeling, but I think I prefer calling pmemd.registerNodeController() in blocking mode.

Moved this to blocking call.

I had a closer look at how pmemd.registerNodeController is called. Making it a blocking call is not enough, it also needs to be moved above the s.Start calls, otherwise the gRPC server is already accepting incoming requests while the registration is still pending.

Sorry if my earlier comment sounded like "node registration failure can be ignored". That was not my intention. It was in the context of your suggestion to move pmemd.registerNodeController() before starting gRPC {identity, controller, node}servers. If we start controller after registration, Any requests to this node controller made by Master from its registration handler might fail. So, we should start ControllerServer by the time we call registry.RegisterController()

Starting identity and node servers prior to registration, which allows to register the driver at node with kubelet, prior to registration. Does this leads to issues? I couldn't see any.

And after reading your comments, i felt like we should not keep retry to connect forever, fail after may be n-tries?

If we start controller after registration, Any requests to this node controller made by Master from its registration handler might fail.

No, they don't. The listening socket is open, so connections from master get accepted and queued. The response will be generated as soon as the driver continues from "registering myself" to "serving requests".

Starting identity and node servers prior to registration, which allows to register the driver at node with kubelet, prior to registration. Does this leads to issues? I couldn't see any.

My idea was to not create the Unix domain csi.sock and thus indicating to sidecars that the driver is not ready yet. However, that alone doesn't help an admin, because the sidecars will be running. What we need is a "readiness probe" sidecar. I'll ask in the CSI standup meeting why they implemented a "liveness probe" sidecar, but not a "readiness probe". IMHO "readiness" is more important.

And after reading your comments, i felt like we should not keep retry to connect forever, fail after may be n-tries?

I still think trying to connect forever is the right approach, in particular if an admin can tell that the driver is still initializing (see "readiness" above).

No, they don't. The listening socket is open,

Listening Endpoint socket is only open in Start() call, without calling that no listening socket and if Master controller tries to connect to "Endpoint" we asked to register from its registration handler might fail?

Forget what I said then. I was thinking of "Start" as "start gRPC server" because that is how it is done elsewhere (if I remember correctly), but here it is indeed different.

We probably need to revisit this entire "bring up pmem-csi" topic sometime on the future when considering what it means for pmem-csi to be "ready" and what different failure scenarios could be.

Ok, i did what the best we can do with current code:

Start ControllerServer

Initiate Register, exit if fail.

Start identity and node servers.

This is useful for the caller to handle the error cases. Signed-off-by: Amarnath Valluri <amarnath.valluri@intel.com>

pkg/pmem-csi-driver/pmem-csi-driver.go

pohly · 2019-03-08T13:26:31Z

pkg/pmem-csi-driver/pmem-csi-driver.go

-
-	if _, err = client.RegisterController(ctx, &req); err != nil {
-		return fmt.Errorf("Fail to register with server at '%s' : %s", pmemd.cfg.RegistryEndpoint, err.Error())
+	for {


Still problematic. Consider what happens after 10 minutes: all gRPC calls will immediately time out and the for loop will keep running forever.

Yes, i agree. What about using no timeout context - WithCancel().

Fixed usingWithCancel()context

Instead of exiting with error, driver has to wait and retry 'node controller' registration till the registry server up and registration get succeed. FIXES: intel#182 Signed-off-by: Amarnath Valluri <amarnath.valluri@intel.com>

avalluri mentioned this pull request Mar 4, 2019

bring up Kubernetes deployment without restarts #182

Closed

pohly reviewed Mar 6, 2019

View reviewed changes

Registry: return error codes for gRPC calls

3419b09

This is useful for the caller to handle the error cases. Signed-off-by: Amarnath Valluri <amarnath.valluri@intel.com>

avalluri force-pushed the retry-registration branch from dd5500a to a686d9a Compare March 6, 2019 13:17

pohly suggested changes Mar 6, 2019

View reviewed changes

pkg/pmem-csi-driver/pmem-csi-driver.go Outdated Show resolved Hide resolved

pkg/pmem-csi-driver/pmem-csi-driver.go Outdated Show resolved Hide resolved

avalluri force-pushed the retry-registration branch from a686d9a to 15ccdc6 Compare March 7, 2019 08:31

pohly reviewed Mar 8, 2019

View reviewed changes

Keep retry registering node controller with RegistryServer

ea3a44f

Instead of exiting with error, driver has to wait and retry 'node controller' registration till the registry server up and registration get succeed. FIXES: intel#182 Signed-off-by: Amarnath Valluri <amarnath.valluri@intel.com>

avalluri force-pushed the retry-registration branch from 15ccdc6 to ea3a44f Compare March 8, 2019 14:07

avalluri requested a review from pohly March 8, 2019 14:09

pohly approved these changes Mar 11, 2019

View reviewed changes

avalluri merged commit 586ae28 into intel:devel Mar 11, 2019

avalluri deleted the retry-registration branch March 15, 2019 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep retry registering node controller with RegistryServer #184

Keep retry registering node controller with RegistryServer #184

avalluri commented Mar 4, 2019

okartau commented Mar 4, 2019 •

edited

Loading

okartau commented Mar 4, 2019

avalluri commented Mar 5, 2019

pohly Mar 6, 2019

pohly Mar 6, 2019

avalluri Mar 6, 2019

avalluri Mar 6, 2019

pohly Mar 6, 2019

avalluri Mar 8, 2019

pohly Mar 8, 2019

avalluri Mar 8, 2019

pohly Mar 8, 2019

avalluri Mar 8, 2019

pohly Mar 8, 2019

avalluri Mar 8, 2019

avalluri Mar 8, 2019

Keep retry registering node controller with RegistryServer #184

Keep retry registering node controller with RegistryServer #184

Conversation

avalluri commented Mar 4, 2019

okartau commented Mar 4, 2019 • edited Loading

okartau commented Mar 4, 2019

avalluri commented Mar 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

okartau commented Mar 4, 2019 •

edited

Loading