Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from 1.7.2 to 1.10.0 causes connections to hang #1899

Closed
stevvooe opened this issue Mar 6, 2018 · 20 comments · Fixed by #1943
Closed

Upgrading from 1.7.2 to 1.10.0 causes connections to hang #1899

stevvooe opened this issue Mar 6, 2018 · 20 comments · Fixed by #1943

Comments

@stevvooe
Copy link
Contributor

stevvooe commented Mar 6, 2018

Please answer these questions before submitting your issue.

What version of gRPC are you using?

Was using 1.7.2 and now using 1.10.0

What version of Go are you using (go version)?

1.10

What operating system (Linux, Windows, …) and version?

Confirmed on Ubuntu 16.04, 17.10 and Amazon Linux.

What did you do?

We upgraded from 1.7.2 to 1.10 grpc package (see containerd/containerd#2186). The code connects with the following https://github.com/containerd/containerd/blob/master/client.go#L71. This is typically over a unix socket.

Details are here: containerd/containerd#2185.

What did you expect to see?

It should not hang.

What did you see instead?

It hangs until there is a timeout. Here is the user experience with a containerd client:

~/g/s/g/c/containerd ❯❯❯ sudo ctr images pull docker.io/library/wordpress:latest                                                                                                                                                                                                                                   master ✭ ◼
ctr: failed to dial "/run/containerd/containerd.sock": context deadline exceeded
@stevvooe
Copy link
Contributor Author

stevvooe commented Mar 6, 2018

@MakMukhi Let me know if there is further troubleshooting required to help you isolate the issue.

@dfawley
Copy link
Member

dfawley commented Mar 6, 2018

My guess is this is fixed by #1889, since you are dialing "unix://..." (via dialer.DialAddress). Can you test with that PR included, and we can cherry-pick it into a 1.10.1 release if needed?

@stevvooe
Copy link
Contributor Author

stevvooe commented Mar 7, 2018

@dfawley I gave that fix a try and am still seeing the client hang and timeout. Is it possible other changes are needed for this to be a full fix, as well?

@dfawley
Copy link
Member

dfawley commented Mar 7, 2018

@stevvooe can you try turning on info logging and see if anything useful comes out? Thanks!

GRPC_GO_LOG_SEVERITY_LEVEL=INFO (unless you have a custom logger installed).

@stevvooe
Copy link
Contributor Author

stevvooe commented Mar 7, 2018

@dfawley It looks to be blocking on the balancer picker:

~/g/s/g/c/containerd ❯❯❯ sudo GRPC_GO_LOG_SEVERITY_LEVEL=INFO  ctr --debug images pull docker.io/library/wordpress:latest                                                                                                                                                                               update-grpc-110 ✭ ✱ ◼
INFO[0000] connecting to containerd                     
INFO[0000] dialing containerd                            address="/run/containerd/containerd.sock"
2018/03/07 13:38:54 dialing to target with scheme: "unix"
2018/03/07 13:38:54 ccResolverWrapper: sending new addresses to cc: [{run/containerd/containerd.sock 0  <nil>}]
2018/03/07 13:38:54 ClientConn switching balancer to "pick_first"
2018/03/07 13:38:54 pickfirstBalancer: HandleSubConnStateChange: 0xc420077f40, CONNECTING

The first two log lines I added to ctr. "dialing containerd" is at the point that grpc.Dial is called. I get the same problem with or without the patch applied.

@dfawley
Copy link
Member

dfawley commented Mar 7, 2018

Is this with #1889?

What is the actual address that is being passed to grpc.Dial()? Is it "unix:///run/containerd/containerd.sock"?

In this case, you are confusing our target parsing logic. We now follow the naming scheme defined in the grpc spec here: https://github.com/grpc/grpc/blob/master/doc/naming.md#name-syntax. The only exception is our default scheme is "passthrough" instead of "dns", for historical-compatibility reasons.

Because you actually want your dialer to see "unix://</whatever>", you need to manually specify the passthrough resolver like so:

"passthrough:///unix://</whatever>"

(When #1741 is fixed, you won't even need a custom dialer for unix sockets, so this whole mess would not be a problem.)

@stevvooe
Copy link
Contributor Author

stevvooe commented Mar 8, 2018

In this case, you are confusing our target parsing logic.

Will all due respect, I am confusing nothing. We updated the package and now the dialer code doesn't work. Whatever changes were put in have created an incompatibility that is breaking our code.

I get the same output, whether or not #1889 is applied. It doesn't seem to have any effect. The above output is with the patch applied.

I modified the above example to show the address passed to dial:

~/g/s/g/c/containerd ❯❯❯ sudo ctr plugins -d type~=snapshotter                                                                                                                                                                                                                                          update-grpc-110 ✭ ✱ ◼
INFO[0000] connecting to containerd                     
INFO[0000] dialing containerd                            address="/run/containerd/containerd.sock"
INFO[0000] dialing containerd                            address_passed_to_dial="unix:///run/containerd/containerd.sock"
2018/03/08 12:13:02 dialing to target with scheme: "unix"
2018/03/08 12:13:02 ccResolverWrapper: sending new addresses to cc: [{run/containerd/containerd.sock 0  <nil>}]
2018/03/08 12:13:02 ClientConn switching balancer to "pick_first"
2018/03/08 12:13:02 pickfirstBalancer: HandleSubConnStateChange: 0xc4200b3d70, CONNECTING
INFO[0000] address in Dialer wrapper                     address="run/containerd/containerd.sock"
INFO[0000] address in dialer                             address="run/containerd/containerd.sock"
INFO[0000] address trimmed for net.DialTimeout           address="run/containerd/containerd.sock"
ERRO[0000] dial error                                    error="dial unix run/containerd/containerd.sock: connect: no such file or directory"

As you can see, we are passing in the unix:///run/containerd/containerd.sock and when that gets passed to the custom dialer, the address has been stripped to run/containerd/containerd.sock.

Here is the patch to containerd used to debug this:

diff --git a/client.go b/client.go
index 2ac256dd..b9360247 100644
--- a/client.go
+++ b/client.go
@@ -40,6 +40,7 @@ import (
 	"github.com/containerd/containerd/dialer"
 	"github.com/containerd/containerd/errdefs"
 	"github.com/containerd/containerd/images"
+	"github.com/containerd/containerd/log"
 	"github.com/containerd/containerd/namespaces"
 	"github.com/containerd/containerd/platforms"
 	"github.com/containerd/containerd/plugin"
@@ -94,16 +95,23 @@ func New(address string, opts ...ClientOpt) (*Client, error) {
 		)
 	}
 	connector := func() (*grpc.ClientConn, error) {
-		conn, err := grpc.Dial(dialer.DialAddress(address), gopts...)
+
+		log.L.WithField("address", address).Infoln("dialing containerd")
+		addressPassedToDial := dialer.DialAddress(address)
+		log.L.WithField("address_passed_to_dial", addressPassedToDial).Infoln("dialing containerd")
+		conn, err := grpc.Dial(addressPassedToDial, gopts...)
 		if err != nil {
 			return nil, errors.Wrapf(err, "failed to dial %q", address)
 		}
 		return conn, nil
 	}
+	log.L.Infoln("connecting to containerd")
 	conn, err := connector()
 	if err != nil {
 		return nil, err
 	}
+
+	log.L.Infoln("connected to containerd")
 	return &Client{
 		conn:      conn,
 		connector: connector,
diff --git a/cmd/ctr/main.go b/cmd/ctr/main.go
index ec41c59a..a72c8fba 100644
--- a/cmd/ctr/main.go
+++ b/cmd/ctr/main.go
@@ -18,7 +18,6 @@ package main
 
 import (
 	"fmt"
-	"io/ioutil"
 	"log"
 	"os"
 
@@ -45,7 +44,7 @@ var extraCmds = []cli.Command{}
 
 func init() {
 	// Discard grpc logs so that they don't mess with our stdio
-	grpclog.SetLogger(log.New(ioutil.Discard, "", log.LstdFlags))
+	grpclog.SetLogger(log.New(os.Stderr, "", log.LstdFlags))
 
 	cli.VersionPrinter = func(c *cli.Context) {
 		fmt.Println(c.App.Name, version.Package, c.App.Version)
diff --git a/dialer/dialer.go b/dialer/dialer.go
index 766d3449..35dec9ec 100644
--- a/dialer/dialer.go
+++ b/dialer/dialer.go
@@ -20,6 +20,7 @@ import (
 	"net"
 	"time"
 
+	"github.com/containerd/containerd/log"
 	"github.com/pkg/errors"
 )
 
@@ -30,6 +31,7 @@ type dialResult struct {
 
 // Dialer returns a GRPC net.Conn connected to the provided address
 func Dialer(address string, timeout time.Duration) (net.Conn, error) {
+	log.L.WithField("address", address).Infof("address in Dialer wrapper")
 	var (
 		stopC = make(chan struct{})
 		synC  = make(chan *dialResult)
@@ -43,6 +45,7 @@ func Dialer(address string, timeout time.Duration) (net.Conn, error) {
 			default:
 				c, err := dialer(address, timeout)
 				if isNoent(err) {
+					log.L.WithError(err).Error("dial error")
 					<-time.After(10 * time.Millisecond)
 					continue
 				}
diff --git a/dialer/dialer_unix.go b/dialer/dialer_unix.go
index e7d19583..8a3fddc5 100644
--- a/dialer/dialer_unix.go
+++ b/dialer/dialer_unix.go
@@ -25,6 +25,8 @@ import (
 	"strings"
 	"syscall"
 	"time"
+
+	"github.com/containerd/containerd/log"
 )
 
 // DialAddress returns the address with unix:// prepended to the
@@ -47,6 +49,8 @@ func isNoent(err error) bool {
 }
 
 func dialer(address string, timeout time.Duration) (net.Conn, error) {
+	log.L.WithField("address", address).Infof("address in dialer")
 	address = strings.TrimPrefix(address, "unix://")
+	log.L.WithField("address", address).Infof("address trimmed for net.DialTimeout")
 	return net.DialTimeout("unix", address, timeout)
 }

@dfawley
Copy link
Member

dfawley commented Mar 8, 2018

Will all due respect, I am confusing nothing. We updated the package and now the dialer code
doesn't work. Whatever changes were put in have created an incompatibility that is breaking
our code.

The behavior change brings us in line with the gRPC spec. I'm sorry for the breakage, but your input happens to exactly follow the format the spec defines, so it unfortunately doesn't end up following the fallback behavior intended to maintain backward compatibility for most users.

You should be able to fix this by prefixing your target with "passthrough:///". Then the full text after that will be handed directly to your custom dialer, as it was before. Let me know if that doesn't work and I'll take another look.

@dfawley dfawley closed this as completed Mar 8, 2018
@stevvooe
Copy link
Contributor Author

stevvooe commented Mar 8, 2018

@dfawley How is this possibly correct behavior? Is the unix dialing scheme even supported? We still have to inject our own dialer for this to work at all. If the existing dialing stack doesn't understand the scheme, why is it parsing and modifying it? This seems completely broken.

For example, if you parse off the scheme for unix:///foo, the correct answer for the path is /foo, not foo. That doesn't seemed to be clear in the spec, but that is the correct way to handle URIs in most other systems.

@dfawley
Copy link
Member

dfawley commented Mar 8, 2018

Is the unix dialing scheme even supported?

unix scheme support is covered by #1741. Until it is done, you still need to inject your own dialer, and either:

  1. Don't use "unix://" at the start of your target, or
  2. Use "passthrough:///" at the start.

Once unix support is done, you will not need a custom dialer.

For example, if you parse off the scheme for unix:///foo, the correct answer for the path is /foo, not foo. That doesn't seemed to be clear in the spec, but that is the correct way to handle URIs in most other systems.

In the spec, it says: scheme://authority/endpoint_name

It's true that http URLs grab the separating slash along with the path, but in this case, pulling the slash along with it would not be usable; the endpoint name is typically a DNS name like "google.com". Parsing it out as "/google.com" would be a problem.

FWIW, there was a gRFC published for these changes before they were implemented, since they were fairly significant. You can find them at the proposal repo here: https://github.com/grpc/proposal/blob/master/L9-go-resolver-balancer-API.md

@stevvooe
Copy link
Contributor Author

stevvooe commented Mar 9, 2018

@dfawley With this scheme, how would one continue using a custom dialer if the package incorrectly returns the path for the filesystem?

I appreciate the RFC, but it seems like important details still need to be figured out.

@dfawley
Copy link
Member

dfawley commented Mar 9, 2018

@stevvooe When the unix scheme is supported, you'd use it like so:

unix:////a/b/c (i.e. unix:/// specifies the unix scheme with no authority and /a/b/c specifies the path)

You can use your existing custom dialer similarly:

passthrough:///unix:///a/b/c will pass unix:///a/b/c to your dialer, which is free to interpret that as it wishes (e.g. remove unix:// and treat the path as /a/b/c).

@stevvooe
Copy link
Contributor Author

stevvooe commented Mar 9, 2018

unix:////a/b/c

No system in the world implements it this way.

@dfawley
Copy link
Member

dfawley commented Mar 9, 2018

cc @markdroth

@AkihiroSuda
Copy link

I expect the unix:// scheme to be consistent with the file:// scheme, which is defined in RFC 8089, although we might not need the equivalent of the file-auth part.

i.e. +1 for unix:///a/b/c, -1 for unix:////a/b/c 😖 .
Even unix:/a/b/c would be correct as well.

@dfawley
Copy link
Member

dfawley commented Mar 9, 2018

Apparently C implements it this way, despite what is in the gRPC naming spec (which is what we used for our implementation). We'll look into this: #1911.

@stevvooe
Copy link
Contributor Author

stevvooe commented Mar 9, 2018

@dfawley Can we reopen this issue for tracking?

Please us know if you need help with building the use case.

@dfawley
Copy link
Member

dfawley commented Mar 9, 2018

@stevvooe I filed #1911 to track the differences between Go and C.

In the meantime, please use passthrough:///<whatever> to make sure your dialer gets "<whatever>".

@stevvooe
Copy link
Contributor Author

stevvooe commented Mar 9, 2018

In the meantime, please use passthrough:/// to make sure your dialer gets "".

We probably will avoid upgrading, for now. I hope a solution can be found that doesn't break everyone.

I'll also try to attend the community meeting to get a better understanding of the proposal. I know I reviewed earlier versions of this but the current proposal doesn't look very familiar.

@menghanl
Copy link
Contributor

@stevvooe There could be a solution for this issue:

For unix:///foo, the parsed target will be unix, which is not registered.
In this case, if we keep the unparsed target string unix:///foo and pass it to the default passthrough resolver, the dialer will see unix:///foo eventually, and should be able to dial successfully.

Filed #1943 for this.

@menghanl menghanl reopened this Mar 26, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
5 participants