Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when calling Session.Close() while control connection is reconnecting #1687

Closed
joao-r-reis opened this issue Apr 6, 2023 · 6 comments · Fixed by #1688
Closed

Deadlock when calling Session.Close() while control connection is reconnecting #1687

joao-r-reis opened this issue Apr 6, 2023 · 6 comments · Fixed by #1688

Comments

@joao-r-reis
Copy link
Contributor

joao-r-reis commented Apr 6, 2023

I ran into this while testing #1680

Calling Session.Close() while the control connection is reconnecting can lead to a deadlock.

This can be reproduced with the following integration test:

func TestControlConn_CloseDeadlock(t *testing.T) {
	sleep := atomic.Value{}
	sleep.Store(false)
	hostFilter := HostFilterFunc(func(host *HostInfo) bool {
		if sleep.Load().(bool) {
			// this function will be called for every host so to get a 10s sleep we can divide the 10s by number of hosts
			time.Sleep(time.Duration(10/len(getClusterHosts())) * time.Second)
		}
		return true
	})
	session := createSession(t, func(config *ClusterConfig) {
		config.HostFilter = hostFilter
	})
	sleep.Store(true)
	session.control.getConn().conn.Close()
	time.Sleep(7 * time.Second)
	session.Close()
}

After a few seconds if you dump goroutines you will see that one of them will be stuck here:

runtime.gopark(proc.go:364)
runtime.chansend(chan.go:259)
runtime.chansend1(chan.go:145)
github.com/gocql/gocql.(*controlConn).close(control.go:484)
github.com/gocql/gocql.(*Session).Close(session.go:477)
github.com/gocql/gocql.TestControlConn_CloseDeadlock(integration_test.go:269)
testing.tRunner(testing.go:1446)
testing.(*T).Run.func1(testing.go:1493)
runtime.goexit(asm_amd64.s:1594)
testing.(*T).Run(testing.go:1493)

And another one here:

runtime.gopark(proc.go:364)
runtime.goparkunlock(proc.go:369)
runtime.semacquire1(sema.go:150)
sync.runtime_SemacquireMutex(sema.go:77)
sync.(*RWMutex).RLock(rwmutex.go:71)
github.com/gocql/gocql.(*Session).initialized(session.go:501)
github.com/gocql/gocql.(*controlConn).setupConn(control.go:291)
github.com/gocql/gocql.(*controlConn).reconnect(control.go:371)
github.com/gocql/gocql.(*controlConn).heartBeat(control.go:101)
github.com/gocql/gocql.(*controlConn).connect.func1(control.go:260)
runtime.goexit(asm_amd64.s:1594)
github.com/gocql/gocql.(*controlConn).connect(control.go:260)

controlConn.close() gets stuck writing to the c.quit channel and this happens while the goroutine is holding a Lock for s.sessionStateMu.

The c.quit channel is supposed to be read from controlConn.heartBeat() but if this goroutine is in the process of calling controlConn.reconnect(refreshring bool) then it will eventually call c.session.initialized() (as part of controlConn.setupConn(conn *Conn)) which tries to obtain a RLock for s.sessionStateMu and a deadlock happens.

@martin-sucha
Copy link
Contributor

Thanks!

This could be the same issue as described in #1677.

martin-sucha added a commit to kiwicom/gocql that referenced this issue Apr 6, 2023
We switched from separate mutex for closing to sessionStateMu in
312a614.
This change introduced a deadlock.

We don't need to hold the mutex for the whole duration of Close(),
we only need to update the status atomically.
Previously IsClosed() returned true only after all closing is done
(because of the deferred unlock).
We can emulate that by setting isClosed at the end of Close().
We need a new variable to ensure that Close() is only executed once.

Fixes apache#1687
@joao-r-reis
Copy link
Contributor Author

joao-r-reis commented Apr 6, 2023

Oh sorry, I glanced at that one but I didn't find anything related to batches when reproducing this so I kind of ignored it but you're right it looks like the same issue especially when looking at the stack traces.

@martin-sucha
Copy link
Contributor

I opened a pull request with a possible fix, but I haven't verified it yet. Please let me know if it fixes the issue for you. I will be mostly unavailable next ~10 days so I will look into it in ~2 weeks.

@kevinkyyro
Copy link

I have a deadlock that is pretty similar in description to this in v1.6.0

498297 goroutine 1324045437 [chan send, 113 minutes]:
498298 github.com/gocql/gocql.(*refreshDebouncer).stop(0xc0b826a7c0)
498299         /go/pkg/mod/github.com/gocql/gocql@v1.6.0/host_source.go:848 +0x8c
498300 github.com/gocql/gocql.(*Session).Close(0xc03efb0c00)
498301         /go/pkg/mod/github.com/gocql/gocql@v1.6.0/session.go:494 +0x105
498302 github.com/gocql/gocql.NewSession({{0xc24be58930, 0x3, 0x3}, {0x2ef55cf, 0x5}, 0x4, 0x12a05f200, 0x12a05f200, 0x0, 0x755a, ...})
498303         /go/pkg/mod/github.com/gocql/gocql@v1.6.0/session.go:180 +0x98d
498304 github.com/gocql/gocql.(*ClusterConfig).CreateSession(...)
498305         /go/pkg/mod/github.com/gocql/gocql@v1.6.0/cluster.go:289

It looks like a race condition between (*refreshDebouncer).stop() and (*refreshDebouncer).flusher()

  1. stop() acquires d.mu and sets d.stopped to true
  2. flusher() exits the select at the top of the loop and blocks on acquiring d.mu
  3. stop() releases d.mu and tries to write to d.quit
  4. flusher() acquires d.mu and returns because d.stopped is true
  5. stop() is deadlocked because d.quit is unbuffered and the reader has stopped

@joao-r-reis
Copy link
Contributor Author

I have a deadlock that is pretty similar in description to this in v1.6.0

498297 goroutine 1324045437 [chan send, 113 minutes]:
498298 github.com/gocql/gocql.(*refreshDebouncer).stop(0xc0b826a7c0)
498299         /go/pkg/mod/github.com/gocql/gocql@v1.6.0/host_source.go:848 +0x8c
498300 github.com/gocql/gocql.(*Session).Close(0xc03efb0c00)
498301         /go/pkg/mod/github.com/gocql/gocql@v1.6.0/session.go:494 +0x105
498302 github.com/gocql/gocql.NewSession({{0xc24be58930, 0x3, 0x3}, {0x2ef55cf, 0x5}, 0x4, 0x12a05f200, 0x12a05f200, 0x0, 0x755a, ...})
498303         /go/pkg/mod/github.com/gocql/gocql@v1.6.0/session.go:180 +0x98d
498304 github.com/gocql/gocql.(*ClusterConfig).CreateSession(...)
498305         /go/pkg/mod/github.com/gocql/gocql@v1.6.0/cluster.go:289

It looks like a race condition between (*refreshDebouncer).stop() and (*refreshDebouncer).flusher()

1. `stop()` acquires `d.mu` and sets `d.stopped` to true

2. `flusher()` exits the `select` at the top of the loop and blocks on acquiring `d.mu`

3. `stop()` releases `d.mu` and tries to write to `d.quit`

4. `flusher()` acquires `d.mu` and returns because `d.stopped` is true

5. `stop()` is deadlocked because `d.quit` is unbuffered and the reader has stopped

Nice catch, I suggest you add a new issue with those details, the fix should be pretty simple.

@kevinkyyro
Copy link

Got it, thanks #1752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants