-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INSERTs can block backends on startup (v17-only) #10281
Comments
@MMeent mentioned that traces look like everything is just slow because of backpressure (compute is doing a lot of writes), so it's not fully stuck, just slow It's still interesting why it reproduces only with PG 17 and that we can slow everything down with such a simple load. Moving to backlog for now |
Which backpressure do you mean - compute or PS side? Compute side backpressure blocks only writers, while PS - only readers. Unfortunately I was not able to reproduce the problem locally - in my case connect delay never exceeds one second. |
That's interesting. I could reproduce this with:
results in:
By the way, after the script exits, I can see a backend stuck on startup for much longer than 20 seconds:
|
May be backpressure throtling happens in context where some buffer is locked and so started backends are blocked on this buffer. If you are able to reproduce it (I still not), can you please once again capture stack traces of involved backends. |
Yes, of course, though. I've restarted the script, so pids are different:
This time we see no "backpressure" in the backend title, but it changes over time.
So the backtrace of that backend is:
|
Thank you. My hypothesis is that PS is witting for LSN which was not delivered yet, because backend initiated this transaction was blocked by backpressure. But to prove it, it is necessary to inspect SK WAL and LSN 2240648 is waiting for. |
One more run:
The backtrace of the first "starting" backend:
I'll try to make a docker file for reproducing this... |
And are you able to reproduce it using release build? |
Yes, I've reproduced it with release too:
Sorry for the delay -- it took me a while to understand that "cargo neon {command}" doesn't work with the release build. |
And one more time:
with the backtrace of the backend waiting (inside check_hba()):
|
Yes, those are waiting for Pageserver to respond. If you can figure out why Pageserver doesn't respond fast enough then that'd be nice, but based on the info provided here this doesn't seem like it is a Compute issue. |
Yeah, with debug logging added, log_min_messages = DEBUG3, and:
I can see the following for a slow-starting backend:
So reading each block by this backend takes about 4 seconds... |
So looks like the problem is at PS side. Unfortunately I still not able to reproduce it locally. Can you send me please page server log? May be there is some useful information explaining lag of PS... My current hypothesis is the following:
4Mb of WAL per second seems to be too small. Also it is strange to me that this time is so stable. Are you sure that you are using release binaries of Neon? I never started Neon locally using |
This is CPU usage at my system:
As you see, backend consumes 100% CPU and page server - just 30% (but it may be more bounded by disk IO). |
The following (endless, or just running for long) script, when running against v17:
occasionally effectively blocks other backends on startup.
With this additional script:
I can see:
(Note the cpu utilization column.)
where 150080 is waiting inside:
and the backtrace of 150016:
The complete reproducer: repro-hang.txt.
The text was updated successfully, but these errors were encountered: