Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INSERTs can block backends on startup (v17-only) #10281

Open
alexanderlaw opened this issue Jan 4, 2025 · 14 comments
Open

INSERTs can block backends on startup (v17-only) #10281

alexanderlaw opened this issue Jan 4, 2025 · 14 comments
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug

Comments

@alexanderlaw
Copy link

The following (endless, or just running for long) script, when running against v17:

CREATE TABLE tbl(i int, t text) WITH (fillfactor=10);
DO $$
BEGIN   
  LOOP  
    INSERT INTO tbl VALUES (1, repeat('x', 100));
  END LOOP; 
END;

occasionally effectively blocks other backends on startup.

With this additional script:

for i in {1..5}; do delay=$((10 + $RANDOM % 5));
  echo "$i: sleep $delay";
  sleep $delay; date;
  timeout 20 psql -c "SELECT 1" || { echo TIMEOUT; date; ps -Af | grep postgres; exit 1; };
done

I can see:

DROP TABLE
CREATE TABLE
Sat Jan  4 07:12:54 AM EET 2025
 ?column? 
----------
        1
(1 row)

2: sleep 13
Sat Jan  4 07:13:07 AM EET 2025
TIMEOUT
Sat Jan  4 07:13:27 AM EET 2025
a         149878       1  0 07:12 ?        00:00:00 /src/neon/pg_install/v16/bin/postgres -D /src/neon/.neon/storage_controller_db
a         149879  149878  0 07:12 ?        00:00:00 postgres: checkpointer 
a         149880  149878  0 07:12 ?        00:00:00 postgres: background writer 
a         149882  149878  0 07:12 ?        00:00:00 postgres: walwriter 
a         149883  149878  0 07:12 ?        00:00:00 postgres: autovacuum launcher 
a         149884  149878  0 07:12 ?        00:00:00 postgres: logical replication launcher 
a         149890       1  1 07:12 pts/6    00:00:00 /src/neon/target/debug/storage_controller -l 127.0.0.1:1234 --dev --database-url postgresql://localhost:1235/storage_controller --max-offline-interval 10s --max-warming-up-interval 30s --heartbeat-interval 1s --address-for-peers http://127.0.0.1:1234/ --neon-local-repo-dir=/src/neon/.neon
a         149972       1  0 07:12 pts/6    00:00:00 /src/neon/target/debug/compute_ctl --http-port 55433 --pgdata /src/neon/.neon/endpoints/main/pgdata --connstr postgresql://cloud_admin@127.0.0.1:54321/postgres --spec-path /src/neon/.neon/endpoints/main/spec.json --pgbin /src/neon/pg_install/v17/bin/postgres
a         149996  149972  0 07:12 pts/6    00:00:00 /src/neon/pg_install/v17/bin/postgres -D /src/neon/.neon/endpoints/main/pgdata
a         149999  149996  0 07:12 ?        00:00:00 postgres: checkpointer 
a         150000  149996  0 07:12 ?        00:00:00 postgres: background writer 
a         150002  149996  0 07:12 ?        00:00:00 postgres: walwriter 
a         150003  149996  0 07:12 ?        00:00:00 postgres: autovacuum launcher 
a         150004  149996  0 07:12 ?        00:00:00 postgres: Logical replication monitor 
a         150005  149996  0 07:12 ?        00:00:00 postgres: WAL proposer streaming 0/B964F58
a         150006  149996  0 07:12 ?        00:00:00 postgres: logical replication launcher 
a         150009  149996 43 07:12 ?        00:00:19 postgres: cloud_admin postgres 127.0.0.1(41606) idle
a         150011  149996  0 07:12 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(41632) idle
a         150016  149996 29 07:12 ?        00:00:13 postgres: cloud_admin postgres 127.0.0.1(41644) DO
a         150021  149806  0 07:12 pts/6    00:00:00 /src/neon/pg_install/v17/bin/postgres --wal-redo --tenant-shard-id 370ca9cdd59d630227437d0badf85084
a         150022  149996 99 07:12 ?        00:00:43 postgres: cloud_admin template1 127.0.0.1(41658) startup
a         150080  149996 99 07:13 ?        00:00:20 postgres: cloud_admin postgres 127.0.0.1(50152) startup
a         150081  149878  0 07:13 ?        00:00:00 postgres: a storage_controller 127.0.0.1(60100) idle

(Note the cpu utilization column.)

where 150080 is waiting inside:

#0  0x000078497292a007 in epoll_wait (epfd=8, events=0x5f34594a0f38, maxevents=1, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005f34359b9590 in WaitEventSetWaitBlock (set=0x5f34594a0eb8, cur_timeout=0, occurred_events=0x7ffcf6c5a7f0, nevents=1)
    at /src/neon//vendor/postgres-v17/src/backend/storage/ipc/latch.c:1570
#2  0x00005f34359b947a in WaitEventSetWait (set=0x5f34594a0eb8, timeout=0, occurred_events=0x7ffcf6c5a7f0, nevents=1, wait_event_info=117440520)
    at /src/neon//vendor/postgres-v17/src/backend/storage/ipc/latch.c:1516
#3  0x000078497339bc47 in call_PQgetCopyData (shard_no=0, buffer=0x7ffcf6c5a870) at /src/neon//pgxn/neon/libpagestore.c:700
#4  0x000078497339c45f in pageserver_receive (shard_no=0) at /src/neon//pgxn/neon/libpagestore.c:896
#5  0x00007849733a45bf in prefetch_read (slot=0x5f34594b3e00) at /src/neon//pgxn/neon/pagestore_smgr.c:644
#6  0x00007849733a43b5 in prefetch_wait_for (ring_index=13) at /src/neon//pgxn/neon/pagestore_smgr.c:596
#7  0x00007849733a9c7c in neon_read_at_lsnv (rinfo=..., forkNum=MAIN_FORKNUM, base_blockno=8, request_lsns=0x7ffcf6c5ab50, buffers=0x7ffcf6c5af80, nblocks=1, 
    mask=0x7ffcf6c5ae54 "\377\377\377\377") at /src/neon//pgxn/neon/pagestore_smgr.c:2937
#8  0x00007849733aa37c in neon_readv (reln=0x5f345951b770, forknum=MAIN_FORKNUM, blocknum=8, buffers=0x7ffcf6c5af80, nblocks=1)
    at /src/neon//pgxn/neon/pagestore_smgr.c:3167
#9  0x00005f34359efa23 in smgrreadv (reln=0x5f345951b770, forknum=MAIN_FORKNUM, blocknum=8, buffers=0x7ffcf6c5af80, nblocks=1)
    at /src/neon//vendor/postgres-v17/src/backend/storage/smgr/smgr.c:584
#10 0x00005f343599c9d4 in WaitReadBuffers (operation=0x7ffcf6c5b0f0) at /src/neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:1517
#11 0x00005f343599c31f in ReadBuffer_common (rel=0x78497238a248, smgr=0x5f345951b770, smgr_persistence=0 '\000', forkNum=MAIN_FORKNUM, blockNum=8, 
    mode=RBM_NORMAL, strategy=0x0) at /src/neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:1265
#12 0x00005f343599b6fc in ReadBufferExtended (reln=0x78497238a248, forkNum=MAIN_FORKNUM, blockNum=8, mode=RBM_NORMAL, strategy=0x0)
    at /src/neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:816
#13 0x00005f343599b632 in ReadBuffer (reln=0x78497238a248, blockNum=8) at /src/neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:752
#14 0x00005f343599e7a4 in ReleaseAndReadBuffer (buffer=0, relation=0x78497238a248, blockNum=8)
    at /src/neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:2628
#15 0x00005f34354968a8 in heapam_index_fetch_tuple (scan=0x5f34594fcc50, tid=0x5f34594fc8e0, snapshot=0x5f34594fc990, slot=0x5f34594fd110, 
    call_again=0x5f34594fc8e6, all_dead=0x7ffcf6c5b296) at /src/neon//vendor/postgres-v17/src/backend/access/heap/heapam_handler.c:131
#16 0x00005f34354b0401 in table_index_fetch_tuple (scan=0x5f34594fcc50, tid=0x5f34594fc8e0, snapshot=0x5f34594fc990, slot=0x5f34594fd110, 
    call_again=0x5f34594fc8e6, all_dead=0x7ffcf6c5b296) at /src/neon//vendor/postgres-v17/src/include/access/tableam.h:1255
#17 0x00005f34354b1990 in index_fetch_heap (scan=0x5f34594fc880, slot=0x5f34594fd110) at /src/neon//vendor/postgres-v17/src/backend/access/index/indexam.c:637
#18 0x00005f34354b1b09 in index_getnext_slot (scan=0x5f34594fc880, direction=ForwardScanDirection, slot=0x5f34594fd110)
    at /src/neon//vendor/postgres-v17/src/backend/access/index/indexam.c:697
#19 0x00005f34354af746 in systable_getnext (sysscan=0x5f34594fc830) at /src/neon//vendor/postgres-v17/src/backend/access/index/genam.c:511
#20 0x00005f3435bd98c8 in ScanPgRelation (targetRelId=2964, indexOK=true, force_non_historic=false)
    at /src/neon//vendor/postgres-v17/src/backend/utils/cache/relcache.c:387
#21 0x00005f3435bdad90 in RelationBuildDesc (targetRelId=2964, insertIt=true) at /src/neon//vendor/postgres-v17/src/backend/utils/cache/relcache.c:1094
#22 0x00005f3435bdcc21 in RelationIdGetRelation (relationId=2964) at /src/neon//vendor/postgres-v17/src/backend/utils/cache/relcache.c:2116
#23 0x00005f3435432c64 in relation_open (relationId=2964, lockmode=1) at /src/neon//vendor/postgres-v17/src/backend/access/common/relation.c:58
#24 0x00005f34354f8a0b in table_open (relationId=2964, lockmode=1) at /src/neon//vendor/postgres-v17/src/backend/access/table/table.c:44
#25 0x00005f3435c0bc5a in process_settings (databaseid=5, roleid=10) at /src/neon//vendor/postgres-v17/src/backend/utils/init/postinit.c:1351
#26 0x00005f3435c0b9e9 in InitPostgres (in_dbname=0x5f345949dce8 "postgres", dboid=5, username=0x5f345949dcc8 "cloud_admin", useroid=0, flags=1, 
    out_dbname=0x0) at /src/neon//vendor/postgres-v17/src/backend/utils/init/postinit.c:1234
#27 0x00005f34359f9dc8 in PostgresMain (dbname=0x5f345949dce8 "postgres", username=0x5f345949dcc8 "cloud_admin")
    at /src/neon//vendor/postgres-v17/src/backend/tcop/postgres.c:4334
#28 0x00005f34359f0e2b in BackendMain (startup_data=0x7ffcf6c5b7e4 "", startup_data_len=4)
    at /src/neon//vendor/postgres-v17/src/backend/tcop/backend_startup.c:105
#29 0x00005f3435904f3f in postmaster_child_launch (child_type=B_BACKEND, startup_data=0x7ffcf6c5b7e4 "", startup_data_len=4, client_sock=0x7ffcf6c5b830)
    at /src/neon//vendor/postgres-v17/src/backend/postmaster/launch_backend.c:277
#30 0x00005f343590b490 in BackendStartup (client_sock=0x7ffcf6c5b830) at /src/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:3594
#31 0x00005f34359084aa in ServerLoop () at /src/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1674
#32 0x00005f3435907d6f in PostmasterMain (argc=3, argv=0x5f34593f0950) at /src/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1372

and the backtrace of 150016:

#0  palloc (size=112) at /src/neon//vendor/postgres-v17/src/backend/utils/mmgr/mcxt.c:1323
#1  0x00005f34359fb490 in CreateQueryDesc (plannedstmt=0x5f345955bba0, sourceText=0x5f34594944d8 "INSERT INTO tbl VALUES (1, repeat('x', 100))", 
    snapshot=0x5f3459499de0, crosscheck_snapshot=0x0, dest=0x5f3436010a80 <spi_printtupDR>, params=0x0, queryEnv=0x0, instrument_options=0)
    at /src/neon//vendor/postgres-v17/src/backend/tcop/pquery.c:76
#2  0x00005f3435791b98 in _SPI_execute_plan (plan=0x5f34595576d0, options=0x7ffcf6c5a7f0, snapshot=0x0, crosscheck_snapshot=0x0, fire_triggers=true)
    at /src/neon//vendor/postgres-v17/src/backend/executor/spi.c:2692
#3  0x00005f343578e5f4 in SPI_execute_plan_with_paramlist (plan=0x5f34595576d0, params=0x0, read_only=false, tcount=0)
    at /src/neon//vendor/postgres-v17/src/backend/executor/spi.c:751
#4  0x000078496e238a67 in exec_stmt_execsql (estate=0x7ffcf6c5abe0, stmt=0x5f345955f5f8) at /src/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:4295
#5  0x000078496e233b0a in exec_stmts (estate=0x7ffcf6c5abe0, stmts=0x5f345955f648) at /src/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:2092
#6  0x000078496e234e46 in exec_stmt_loop (estate=0x7ffcf6c5abe0, stmt=0x5f345955f698) at /src/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:2637
#7  0x000078496e2339c6 in exec_stmts (estate=0x7ffcf6c5abe0, stmts=0x5f345955f6c8) at /src/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:2044
#8  0x000078496e2336bb in exec_stmt_block (estate=0x7ffcf6c5abe0, block=0x5f345955f718) at /src/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:1943
#9  0x000078496e232dbf in exec_toplevel_block (estate=0x7ffcf6c5abe0, block=0x5f345955f718) at /src/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:1634
#10 0x000078496e230a4d in plpgsql_exec_function (func=0x5f34595530a8, fcinfo=0x7ffcf6c5adc0, simple_eval_estate=0x5f345954a620, 
    simple_eval_resowner=0x5f34594a2410, procedure_resowner=0x5f34594a2410, atomic=false) at /src/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:623
#11 0x000078496e24dc87 in plpgsql_inline_handler (fcinfo=0x7ffcf6c5aef0) at /src/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_handler.c:368
#12 0x00005f3435bfd552 in FunctionCall1Coll (flinfo=0x7ffcf6c5af50, collation=0, arg1=104678441653040)
    at /src/neon//vendor/postgres-v17/src/backend/utils/fmgr/fmgr.c:1139
#13 0x00005f3435bfe4e6 in OidFunctionCall1Coll (functionId=13651, collation=0, arg1=104678441653040)
    at /src/neon//vendor/postgres-v17/src/backend/utils/fmgr/fmgr.c:1417
#14 0x00005f3435680f1a in ExecuteDoStmt (pstate=0x5f3459549220, stmt=0x5f3459437620, atomic=false)
    at /src/neon//vendor/postgres-v17/src/backend/commands/functioncmds.c:2156
#15 0x00005f34359fedb0 in standard_ProcessUtility (pstmt=0x5f34594376c0, 
    queryString=0x5f3459436a90 "DO $$\nBEGIN   \n  LOOP  \n    INSERT INTO tbl VALUES (1, repeat('x', 100));\n  END LOOP; \nEND;\n$$;", readOnlyTree=false, 
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x5f3459437a80, qc=0x7ffcf6c5b560)
    at /src/neon//vendor/postgres-v17/src/backend/tcop/utility.c:707
#16 0x00007849733acbb0 in CheckUnstableExtension (pstmt=0x5f34594376c0, 
    queryString=0x5f3459436a90 "DO $$\nBEGIN   \n  LOOP  \n    INSERT INTO tbl VALUES (1, repeat('x', 100));\n  END LOOP; \nEND;\n$$;", readOnlyTree=false, 
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x5f3459437a80, qc=0x7ffcf6c5b560) at /src/neon//pgxn/neon/unstable_extensions.c:92
#17 0x00007849733b83a2 in NeonProcessUtility (pstmt=0x5f34594376c0, 
    queryString=0x5f3459436a90 "DO $$\nBEGIN   \n  LOOP  \n    INSERT INTO tbl VALUES (1, repeat('x', 100));\n  END LOOP; \nEND;\n$$;", readOnlyTree=false, 
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x5f3459437a80, qc=0x7ffcf6c5b560) at /src/neon//pgxn/neon/control_plane_connector.c:838
#18 0x00005f34359fe766 in ProcessUtility (pstmt=0x5f34594376c0, 
    queryString=0x5f3459436a90 "DO $$\nBEGIN   \n  LOOP  \n    INSERT INTO tbl VALUES (1, repeat('x', 100));\n  END LOOP; \nEND;\n$$;", readOnlyTree=false, 
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x5f3459437a80, qc=0x7ffcf6c5b560)
    at /src/neon//vendor/postgres-v17/src/backend/tcop/utility.c:519

The complete reproducer: repro-hang.txt.

@alexanderlaw alexanderlaw added the t/bug Issue Type: Bug label Jan 4, 2025
@bayandin bayandin added the c/compute Component: compute, excluding postgres itself label Jan 7, 2025
@ololobus
Copy link
Member

ololobus commented Jan 7, 2025

@MMeent mentioned that traces look like everything is just slow because of backpressure (compute is doing a lot of writes), so it's not fully stuck, just slow

It's still interesting why it reproduces only with PG 17 and that we can slow everything down with such a simple load. Moving to backlog for now

@knizhnik
Copy link
Contributor

knizhnik commented Jan 8, 2025

Which backpressure do you mean - compute or PS side? Compute side backpressure blocks only writers, while PS - only readers.
From the attached stack traces of two backends it is clear that one backend (insert) is active (not blocked by compute backpressure) while another one is waiting response from PS. It can be caused server backpressure. But it is strange that it can delay query for so long time.

Unfortunately I was not able to reproduce the problem locally - in my case connect delay never exceeds one second.

@alexanderlaw
Copy link
Author

Unfortunately I was not able to reproduce the problem locally - in my case connect delay never exceeds one second.

That's interesting. I could reproduce this with:

# git log
commit 5c76e2a983295f3123631b0178309a942f584596 (HEAD -> main, origin/main, origin/HEAD)

make -s distclean
CARGO_BUILD_FLAGS="--features=testing" BUILD_TYPE=debug make -j`nproc` -s

wget https://github.com/user-attachments/files/18305706/repro-hang.txt -O repro-hang.sh # the demo script attached above

chmod a+x repro-hang.sh

./repro-hang.sh

results in:

...
Starting existing endpoint main...
Starting postgres node at 'postgresql://cloud_admin@127.0.0.1:54321/postgres'
Wed Jan  8 09:35:25 EET 2025
1: sleep 13
NOTICE:  table "tbl" does not exist, skipping
DROP TABLE
CREATE TABLE
Wed Jan  8 09:35:38 EET 2025
 ?column? 
----------
        1
(1 row)

2: sleep 12
Wed Jan  8 09:35:50 EET 2025
 ?column? 
----------
        1
(1 row)

3: sleep 12
Wed Jan  8 09:36:02 EET 2025
TIMEOUT
Wed Jan  8 09:36:22 EET 2025
a        1948336    3382  0 09:35 ?        00:00:00 /t/neon/pg_install/v16/bin/postgres -D /t/neon/.neon/storage_controller_db
a        1948337 1948336  0 09:35 ?        00:00:00 postgres: checkpointer 
a        1948338 1948336  0 09:35 ?        00:00:00 postgres: background writer 
a        1948340 1948336  0 09:35 ?        00:00:00 postgres: walwriter 
a        1948341 1948336  0 09:35 ?        00:00:00 postgres: autovacuum launcher 
a        1948342 1948336  0 09:35 ?        00:00:00 postgres: logical replication launcher 
a        1948348    3382  1 09:35 pts/1    00:00:00 /t/neon/target/debug/storage_controller -l 127.0.0.1:1234 ...
a        1948434    3382  0 09:35 pts/1    00:00:00 /t/neon/target/debug/compute_ctl --http-port 55433 ...
a        1948457 1948434  0 09:35 pts/1    00:00:00 /t/neon/pg_install/v17/bin/postgres -D /t/neon/.neon/endpoints/main/pgdata
a        1948460 1948457  0 09:35 ?        00:00:00 postgres: checkpointer 
a        1948461 1948457  0 09:35 ?        00:00:00 postgres: background writer 
a        1948463 1948457  0 09:35 ?        00:00:00 postgres: walwriter 
a        1948464 1948457  0 09:35 ?        00:00:00 postgres: autovacuum launcher 
a        1948465 1948457  0 09:35 ?        00:00:00 postgres: Logical replication monitor 
a        1948466 1948457  1 09:35 ?        00:00:00 postgres: WAL proposer streaming 0/EB48480
a        1948467 1948457  0 09:35 ?        00:00:00 postgres: logical replication launcher 
a        1948470 1948457  1 09:35 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(45782) idle
a        1948472 1948457  0 09:35 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(45798) idle
a        1948477 1948457 26 09:35 ?        00:00:15 postgres: cloud_admin postgres 127.0.0.1(45804) DObackpressure throttling: lag 33920
a        1948480 1948251  0 09:35 pts/1    00:00:00 /t/neon/pg_install/v17/bin/postgres --wal-redo --tenant-shard-id f3eda4b0adfcb88fad744c0f571572ac
a        1948481 1948457  0 09:35 ?        00:00:00 postgres: cloud_admin template1 127.0.0.1(45822) startup
a        1948578 1948457  0 09:36 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(51802) startup
a        1948592 1948336  0 09:36 ?        00:00:00 postgres: a storage_controller 127.0.0.1(38022) idle
a        1948595 1946729  0 09:36 pts/1    00:00:00 grep postgres

By the way, after the script exits, I can see a backend stuck on startup for much longer than 20 seconds:

ps -Af |grep post; date
a        1948481 1948457  0 09:35 ?        00:00:00 postgres: cloud_admin template1 127.0.0.1(45822) startup
...
Wed Jan  8 09:40:03 EET 2025

@knizhnik
Copy link
Contributor

knizhnik commented Jan 8, 2025

May be backpressure throtling happens in context where some buffer is locked and so started backends are blocked on this buffer. If you are able to reproduce it (I still not), can you please once again capture stack traces of involved backends.
First of all one blocked in backpressure (in process list above - 1948477

@alexanderlaw
Copy link
Author

Yes, of course, though. I've restarted the script, so pids are different:

3: sleep 14
Wed Jan  8 13:28:17 EET 2025
TIMEOUT
Wed Jan  8 13:28:37 EET 2025
a        2240445    3382  0 13:27 ?        00:00:00 /t/neon/pg_install/v16/bin/postgres -D /t/neon/.neon/storage_controller_db
a        2240446 2240445  0 13:27 ?        00:00:00 postgres: checkpointer 
a        2240447 2240445  0 13:27 ?        00:00:00 postgres: background writer 
a        2240449 2240445  0 13:27 ?        00:00:00 postgres: walwriter 
a        2240450 2240445  0 13:27 ?        00:00:00 postgres: autovacuum launcher 
a        2240451 2240445  0 13:27 ?        00:00:00 postgres: logical replication launcher 
a        2240457    3382  1 13:27 pts/1    00:00:00 /t/neon/target/debug/storage_controller -l 127.0.0.1:1234 ...
a        2240537    3382  0 13:27 pts/1    00:00:00 /t/neon/target/debug/compute_ctl ...
a        2240562 2240537  0 13:27 pts/1    00:00:00 /t/neon/pg_install/v17/bin/postgres -D /t/neon/.neon/endpoints/main/pgdata
a        2240565 2240562  0 13:27 ?        00:00:00 postgres: checkpointer 
a        2240566 2240562  0 13:27 ?        00:00:00 postgres: background writer 
a        2240568 2240562  0 13:27 ?        00:00:00 postgres: walwriter 
a        2240569 2240562  0 13:27 ?        00:00:00 postgres: autovacuum launcher 
a        2240570 2240562  0 13:27 ?        00:00:00 postgres: Logical replication monitor 
a        2240571 2240562  1 13:27 ?        00:00:00 postgres: WAL proposer streaming 0/FD7E3C8
a        2240572 2240562  0 13:27 ?        00:00:00 postgres: logical replication launcher 
a        2240576 2240562  1 13:27 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(34026) SELECT
a        2240577 2240562  0 13:27 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(34040) idle
a        2240583 2240562 27 13:27 ?        00:00:15 postgres: cloud_admin postgres 127.0.0.1(34054) DO
a        2240584 2240562  0 13:27 ?        00:00:00 postgres: cloud_admin template1 127.0.0.1(34058) startup
a        2240586 2240358  0 13:27 pts/1    00:00:00 /t/neon/pg_install/v17/bin/postgres --wal-redo --tenant-shard-id c96d361c9e4828a93db9d3fc6cff9a58
a        2240648 2240562  0 13:28 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(44848) startup
a        2240663 2240445  0 13:28 ?        00:00:00 postgres: a storage_controller 127.0.0.1(54602) idle
a        2240672 2240334  0 13:28 pts/1    00:00:00 grep postgres

This time we see no "backpressure" in the backend title, but it changes over time.

$ ps -Af | grep 2240583
a        2240583 2240562 26 13:27 ?        00:00:50 postgres: cloud_admin postgres 127.0.0.1(34054) DO
a        2240839 1804001  0 13:30 pts/1    00:00:00 grep --color=auto 2240583
$ ps -Af | grep 2240583
a        2240583 2240562 26 13:27 ?        00:00:50 postgres: cloud_admin postgres 127.0.0.1(34054) DObackpressure throttling: lag 11560
a        2240841 1804001  0 13:30 pts/1    00:00:00 grep --color=auto 2240583
$ ps -Af | grep 2240583
a        2240583 2240562 26 13:27 ?        00:00:50 postgres: cloud_admin postgres 127.0.0.1(34054) DO
a        2240847 1804001  0 13:30 pts/1    00:00:00 grep --color=auto 2240583

So the backtrace of that backend is:

Attaching to process 2240583
...
(gdb) bt
#0  0x000077911feeca7a in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7fffdef5c460, rem=0x0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x000077911fef9a27 in __GI___nanosleep (req=<optimized out>, rem=<optimized out>) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x0000634cc2023862 in pg_usleep (microsec=10000) at /t/neon//vendor/postgres-v17/src/port/pgsleep.c:50
#3  0x0000779120ad0651 in backpressure_throttling_impl () at /t/neon//pgxn/neon/walproposer_pg.c:461
#4  0x0000634cc1da3a04 in ProcessInterrupts () at /t/neon//vendor/postgres-v17/src/backend/tcop/postgres.c:3507
#5  0x0000634cc1b2502e in ExecResult (pstate=0x634cc790f830) at /t/neon//vendor/postgres-v17/src/backend/executor/nodeResult.c:74
#6  0x0000634cc1b1a655 in ExecProcNode (node=0x634cc790f830) at /t/neon//vendor/postgres-v17/src/include/executor/executor.h:274
#7  0x0000634cc1b20c9a in ExecModifyTable (pstate=0x634cc790f380)
    at /t/neon//vendor/postgres-v17/src/backend/executor/nodeModifyTable.c:3912
#8  0x0000634cc1add529 in ExecProcNodeFirst (node=0x634cc790f380)
    at /t/neon//vendor/postgres-v17/src/backend/executor/execProcnode.c:464
#9  0x0000634cc1ad0600 in ExecProcNode (node=0x634cc790f380) at /t/neon//vendor/postgres-v17/src/include/executor/executor.h:274
#10 0x0000634cc1ad3463 in ExecutePlan (estate=0x634cc790f100, planstate=0x634cc790f380, use_parallel_mode=false, 
    operation=CMD_INSERT, sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x634cc23bba80 <spi_printtupDR>, 
    execute_once=true) at /t/neon//vendor/postgres-v17/src/backend/executor/execMain.c:1654
#11 0x0000634cc1ad0d4e in standard_ExecutorRun (queryDesc=0x634cc79030b0, direction=ForwardScanDirection, count=0, 
    execute_once=true) at /t/neon//vendor/postgres-v17/src/backend/executor/execMain.c:365
#12 0x0000634cc1ad0b24 in ExecutorRun (queryDesc=0x634cc79030b0, direction=ForwardScanDirection, count=0, execute_once=true)
    at /t/neon//vendor/postgres-v17/src/backend/executor/execMain.c:306
#13 0x0000634cc1b3d1c0 in _SPI_pquery (queryDesc=0x634cc79030b0, fire_triggers=true, tcount=0)
    at /t/neon//vendor/postgres-v17/src/backend/executor/spi.c:2932
#14 0x0000634cc1b3cbd1 in _SPI_execute_plan (plan=0x634cc785ac00, options=0x7fffdef5c960, snapshot=0x0, crosscheck_snapshot=0x0, 
    fire_triggers=true) at /t/neon//vendor/postgres-v17/src/backend/executor/spi.c:2699
#15 0x0000634cc1b395f4 in SPI_execute_plan_with_paramlist (plan=0x634cc785ac00, params=0x0, read_only=false, tcount=0)
    at /t/neon//vendor/postgres-v17/src/backend/executor/spi.c:751
#16 0x000077911db97a67 in exec_stmt_execsql (estate=0x7fffdef5cd50, stmt=0x634cc7905ba8)
    at /t/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:4295
#17 0x000077911db92b0a in exec_stmts (estate=0x7fffdef5cd50, stmts=0x634cc7905bf8)
    at /t/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:2092
#18 0x000077911db93e46 in exec_stmt_loop (estate=0x7fffdef5cd50, stmt=0x634cc7905c48)
    at /t/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:2637
#19 0x000077911db929c6 in exec_stmts (estate=0x7fffdef5cd50, stmts=0x634cc7905c78)
    at /t/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:2044
#20 0x000077911db926bb in exec_stmt_block (estate=0x7fffdef5cd50, block=0x634cc7905cc8)
    at /t/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:1943
#21 0x000077911db91dbf in exec_toplevel_block (estate=0x7fffdef5cd50, block=0x634cc7905cc8)
    at /t/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:1634
#22 0x000077911db8fa4d in plpgsql_exec_function (func=0x634cc78f31b8, fcinfo=0x7fffdef5cf30, simple_eval_estate=0x634cc78e6320, 
    simple_eval_resowner=0x634cc783e3d0, procedure_resowner=0x634cc783e3d0, atomic=false)
    at /t/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_exec.c:623
#23 0x000077911dbacc87 in plpgsql_inline_handler (fcinfo=0x7fffdef5d060)
    at /t/neon//vendor/postgres-v17/src/pl/plpgsql/src/pl_handler.c:368
#24 0x0000634cc1fa8552 in FunctionCall1Coll (flinfo=0x7fffdef5d0c0, collation=0, arg1=109181416626224)
    at /t/neon//vendor/postgres-v17/src/backend/utils/fmgr/fmgr.c:1139
#25 0x0000634cc1fa94e6 in OidFunctionCall1Coll (functionId=13651, collation=0, arg1=109181416626224)
    at /t/neon//vendor/postgres-v17/src/backend/utils/fmgr/fmgr.c:1417
#26 0x0000634cc1a2bf1a in ExecuteDoStmt (pstate=0x634cc78dd720, stmt=0x634cc77d3520, atomic=false)
    at /t/neon//vendor/postgres-v17/src/backend/commands/functioncmds.c:2156
#27 0x0000634cc1da9db0 in standard_ProcessUtility (pstmt=0x634cc77d35c0, 
    queryString=0x634cc77d2990 "DO $$\nBEGIN   \n  LOOP  \n    INSERT INTO tbl ...
    at /t/neon//vendor/postgres-v17/src/backend/tcop/utility.c:707
#28 0x0000779120ac9d6a in CheckUnstableExtension (pstmt=0x634cc77d35c0, 
    queryString=0x634cc77d2990 "DO $$\nBEGIN   \n  LOOP  \n    INSERT INTO tbl ...
    at /t/neon//pgxn/neon/unstable_extensions.c:92
#29 0x0000779120ad555c in NeonProcessUtility (pstmt=0x634cc77d35c0, 
    queryString=0x634cc77d2990 "DO $$\nBEGIN   \n  LOOP  \n    INSERT INTO tbl ...
    at /t/neon//pgxn/neon/control_plane_connector.c:838
#30 0x0000634cc1da9766 in ProcessUtility (pstmt=0x634cc77d35c0, 
    queryString=0x634cc77d2990 "DO $$\nBEGIN   \n  LOOP  \n    INSERT INTO tbl ...
    at /t/neon//vendor/postgres-v17/src/backend/tcop/utility.c:519
#31 0x0000634cc1da8010 in PortalRunUtility (portal=0x634cc7885eb0, pstmt=0x634cc77d35c0, isTopLevel=true, setHoldSnapshot=false, 
    dest=0x634cc77d3980, qc=0x7fffdef5d6d0) at /t/neon//vendor/postgres-v17/src/backend/tcop/pquery.c:1158
#32 0x0000634cc1da8287 in PortalRunMulti (portal=0x634cc7885eb0, isTopLevel=true, setHoldSnapshot=false, dest=0x634cc77d3980, 
    altdest=0x634cc77d3980, qc=0x7fffdef5d6d0) at /t/neon//vendor/postgres-v17/src/backend/tcop/pquery.c:1315
#33 0x0000634cc1da76d1 in PortalRun (portal=0x634cc7885eb0, count=9223372036854775807, isTopLevel=true, run_once=true, 
    dest=0x634cc77d3980, altdest=0x634cc77d3980, qc=0x7fffdef5d6d0) at /t/neon//vendor/postgres-v17/src/backend/tcop/pquery.c:791
#34 0x0000634cc1d9ffd0 in exec_simple_query (
    query_string=0x634cc77d2990 "DO $$\nBEGIN   \n  LOOP  \n    INSERT INTO tbl ...)
 at /t/neon//vendor/postgres-v17/src/backend/tcop/postgres.c:1280
#35 0x0000634cc1da543f in PostgresMain (dbname=0x634cc7839cf8 "postgres", username=0x634cc7839cd8 "cloud_admin")
    at /t/neon//vendor/postgres-v17/src/backend/tcop/postgres.c:4777
#36 0x0000634cc1d9be2b in BackendMain (startup_data=0x7fffdef5d954 "", startup_data_len=4)
    at /t/neon//vendor/postgres-v17/src/backend/tcop/backend_startup.c:105
#37 0x0000634cc1caff3f in postmaster_child_launch (child_type=B_BACKEND, startup_data=0x7fffdef5d954 "", startup_data_len=4, 
    client_sock=0x7fffdef5d9a0) at /t/neon//vendor/postgres-v17/src/backend/postmaster/launch_backend.c:277
#38 0x0000634cc1cb6490 in BackendStartup (client_sock=0x7fffdef5d9a0)
    at /t/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:3594
#39 0x0000634cc1cb34aa in ServerLoop () at /t/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1674
#40 0x0000634cc1cb2d6f in PostmasterMain (argc=3, argv=0x634cc778d2f0)
    at /t/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1372
#41 0x0000634cc1b60f63 in main (argc=3, argv=0x634cc778d2f0) at /t/neon//vendor/postgres-v17/src/backend/main/main.c:237

@knizhnik
Copy link
Contributor

knizhnik commented Jan 8, 2025

Thank you.
This state sees to be ok - unlikely that this backend is holding any buffers.
And what is the status of 2240648?

My hypothesis is that PS is witting for LSN which was not delivered yet, because backend initiated this transaction was blocked by backpressure. But to prove it, it is necessary to inspect SK WAL and LSN 2240648 is waiting for.

@alexanderlaw
Copy link
Author

One more run:

2: sleep 14
Wed Jan  8 15:32:46 EET 2025
TIMEOUT
Wed Jan  8 15:33:06 EET 2025
a        2261746    3382  0 15:32 ?        00:00:00 /t/neon/pg_install/v16/bin/postgres -D /t/neon/.neon/storage_controller_db
a        2261747 2261746  0 15:32 ?        00:00:00 postgres: checkpointer 
a        2261748 2261746  0 15:32 ?        00:00:00 postgres: background writer 
a        2261750 2261746  0 15:32 ?        00:00:00 postgres: walwriter 
a        2261751 2261746  0 15:32 ?        00:00:00 postgres: autovacuum launcher 
a        2261752 2261746  0 15:32 ?        00:00:00 postgres: logical replication launcher 
a        2261758    3382  1 15:32 pts/1    00:00:00 /t/neon/target/debug/storage_controller -l 127.0.0.1:1234 --dev --database-url postgresql://localhost:1235/storage_controller --max-offline-interval 10s --max-warming-up-interval 30s --heartbeat-interval 1s --address-for-peers http://127.0.0.1:1234/ --neon-local-repo-dir=/t/neon/.neon
a        2261840    3382  0 15:32 pts/1    00:00:00 /t/neon/target/debug/compute_ctl --http-port 55433 --pgdata /t/neon/.neon/endpoints/main/pgdata --connstr postgresql://cloud_admin@127.0.0.1:54321/postgres --spec-path /t/neon/.neon/endpoints/main/spec.json --pgbin /t/neon/pg_install/v17/bin/postgres
a        2261863 2261840  0 15:32 pts/1    00:00:00 /t/neon/pg_install/v17/bin/postgres -D /t/neon/.neon/endpoints/main/pgdata
a        2261866 2261863  0 15:32 ?        00:00:00 postgres: checkpointer 
a        2261867 2261863  0 15:32 ?        00:00:00 postgres: background writer 
a        2261869 2261863  0 15:32 ?        00:00:00 postgres: walwriter 
a        2261870 2261863  0 15:32 ?        00:00:00 postgres: autovacuum launcher 
a        2261871 2261863  0 15:32 ?        00:00:00 postgres: Logical replication monitor 
a        2261872 2261863  1 15:32 ?        00:00:00 postgres: WAL proposer streaming 0/E92C140
a        2261873 2261863  0 15:32 ?        00:00:00 postgres: logical replication launcher 
a        2261876 2261863  1 15:32 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(43508) idle
a        2261877 2261863  0 15:32 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(43524) idle
a        2261883 2261863 28 15:32 ?        00:00:13 postgres: cloud_admin postgres 127.0.0.1(43548) DObackpressure throttling: lag 49472
a        2261886 2261673  0 15:32 pts/1    00:00:00 /t/neon/pg_install/v17/bin/postgres --wal-redo --tenant-shard-id 165a721a1366d3897fb823e2882f05f6
a        2261887 2261863  0 15:32 ?        00:00:00 postgres: cloud_admin template1 127.0.0.1(43562) startup
a        2261917 2261746  0 15:32 ?        00:00:00 postgres: a storage_controller 127.0.0.1(34110) idle
a        2261927 2261863  0 15:32 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(36724) startup

The backtrace of the first "starting" backend:

Attaching to process 2261887

(gdb) bt
#0  0x00007a8312f2a007 in epoll_wait (epfd=8, events=0x5e3dd4a1db68, maxevents=1, timeout=10000)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005e3dae484590 in WaitEventSetWaitBlock (set=0x5e3dd4a1dae8, cur_timeout=10000, occurred_events=0x7ffe8840ec50, nevents=1)
    at /t/neon//vendor/postgres-v17/src/backend/storage/ipc/latch.c:1570
#2  0x00005e3dae48447a in WaitEventSetWait (set=0x5e3dd4a1dae8, timeout=10000, occurred_events=0x7ffe8840ec50, nevents=1, 
    wait_event_info=117440520) at /t/neon//vendor/postgres-v17/src/backend/storage/ipc/latch.c:1516
#3  0x00007a83139a1c66 in call_PQgetCopyData (shard_no=0, buffer=0x7ffe8840ecd0) at /t/neon//pgxn/neon/libpagestore.c:700
#4  0x00007a83139a2474 in pageserver_receive (shard_no=0) at /t/neon//pgxn/neon/libpagestore.c:896
#5  0x00007a83139aa5d4 in prefetch_read (slot=0x5e3dd4a31b40) at /t/neon//pgxn/neon/pagestore_smgr.c:644
#6  0x00007a83139aa3ca in prefetch_wait_for (ring_index=131) at /t/neon//pgxn/neon/pagestore_smgr.c:596
#7  0x00007a83139afe36 in neon_read_at_lsnv (rinfo=..., forkNum=MAIN_FORKNUM, base_blockno=19, request_lsns=0x7ffe8840efb0, 
    buffers=0x7ffe8840f3e0, nblocks=1, mask=0x7ffe8840f2b4 "\377\377\377\377") at /t/neon//pgxn/neon/pagestore_smgr.c:2944
#8  0x00007a83139b0536 in neon_readv (reln=0x5e3dd4a997b8, forknum=MAIN_FORKNUM, blocknum=19, buffers=0x7ffe8840f3e0, nblocks=1)
    at /t/neon//pgxn/neon/pagestore_smgr.c:3174
#9  0x00005e3dae4baa23 in smgrreadv (reln=0x5e3dd4a997b8, forknum=MAIN_FORKNUM, blocknum=19, buffers=0x7ffe8840f3e0, nblocks=1)
    at /t/neon//vendor/postgres-v17/src/backend/storage/smgr/smgr.c:584
#10 0x00005e3dae4679d4 in WaitReadBuffers (operation=0x5e3dd4a7b900)
    at /t/neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:1517
#11 0x00005e3dae464832 in read_stream_next_buffer (stream=0x5e3dd4a7b818, per_buffer_data=0x0)
    at /t/neon//vendor/postgres-v17/src/backend/storage/aio/read_stream.c:666
#12 0x00005e3dadf4ec51 in heap_fetch_next_buffer (scan=0x5e3dd4a7ac30, dir=ForwardScanDirection)
    at /t/neon//vendor/postgres-v17/src/backend/access/heap/heapam.c:649
#13 0x00005e3dadf4f409 in heapgettup_pagemode (scan=0x5e3dd4a7ac30, dir=ForwardScanDirection, nkeys=2, key=0x5e3dd4a7bb38)
    at /t/neon//vendor/postgres-v17/src/backend/access/heap/heapam.c:1011
#14 0x00005e3dadf4fcfd in heap_getnextslot (sscan=0x5e3dd4a7ac30, direction=ForwardScanDirection, slot=0x5e3dd4a7b0d0)
    at /t/neon//vendor/postgres-v17/src/backend/access/heap/heapam.c:1340
#15 0x00005e3dadf79cdb in table_scan_getnextslot (sscan=0x5e3dd4a7ac30, direction=ForwardScanDirection, slot=0x5e3dd4a7b0d0)
    at /t/neon//vendor/postgres-v17/src/include/access/tableam.h:1071
#16 0x00005e3dadf7a80c in systable_getnext (sysscan=0x5e3dd4a7a7f0)
    at /t/neon//vendor/postgres-v17/src/backend/access/index/genam.c:532
#17 0x00005e3dae6a4fd1 in RelationBuildTupleDesc (relation=0x7a83129f2b98)
    at /t/neon//vendor/postgres-v17/src/backend/utils/cache/relcache.c:573
#18 0x00005e3dae6a5fe2 in RelationBuildDesc (targetRelId=2659, insertIt=true)
    at /t/neon//vendor/postgres-v17/src/backend/utils/cache/relcache.c:1185
#19 0x00005e3dae6ab5af in load_critical_index (indexoid=2659, heapoid=1249)
    at /t/neon//vendor/postgres-v17/src/backend/utils/cache/relcache.c:4400
#20 0x00005e3dae6ab09f in RelationCacheInitializePhase3 () at /t/neon//vendor/postgres-v17/src/backend/utils/cache/relcache.c:4177
#21 0x00005e3dae6d697f in InitPostgres (in_dbname=0x5e3dd4a1bd18 "template1", dboid=1, username=0x5e3dd4a1bcf8 "cloud_admin", 
    useroid=0, flags=1, out_dbname=0x0) at /t/neon//vendor/postgres-v17/src/backend/utils/init/postinit.c:1210
#22 0x00005e3dae4c4dc8 in PostgresMain (dbname=0x5e3dd4a1bd18 "template1", username=0x5e3dd4a1bcf8 "cloud_admin")
--Type <RET> for more, q to quit, c to continue without paging--
    at /t/neon//vendor/postgres-v17/src/backend/tcop/postgres.c:4334
#23 0x00005e3dae4bbe2b in BackendMain (startup_data=0x7ffe8840fba4 "", startup_data_len=4)
    at /t/neon//vendor/postgres-v17/src/backend/tcop/backend_startup.c:105
#24 0x00005e3dae3cff3f in postmaster_child_launch (child_type=B_BACKEND, startup_data=0x7ffe8840fba4 "", startup_data_len=4, 
    client_sock=0x7ffe8840fbf0) at /t/neon//vendor/postgres-v17/src/backend/postmaster/launch_backend.c:277
#25 0x00005e3dae3d6490 in BackendStartup (client_sock=0x7ffe8840fbf0)
    at /t/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:3594
#26 0x00005e3dae3d34aa in ServerLoop () at /t/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1674
#27 0x00005e3dae3d2d6f in PostmasterMain (argc=3, argv=0x5e3dd496f2f0)
    at /t/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1372
#28 0x00005e3dae280f63 in main (argc=3, argv=0x5e3dd496f2f0) at /t/neon//vendor/postgres-v17/src/backend/main/main.c:237

(gdb) f 7
#7  0x00007a83139afe36 in neon_read_at_lsnv (rinfo=..., forkNum=MAIN_FORKNUM, base_blockno=19, request_lsns=0x7ffe8840efb0, 
    buffers=0x7ffe8840f3e0, nblocks=1, mask=0x7ffe8840f2b4 "\377\377\377\377") at /t/neon//pgxn/neon/pagestore_smgr.c:2944
2944                    } while (!prefetch_wait_for(ring_index));

(gdb) p rinfo
$1 = {spcOid = 1663, dbOid = 1, relNumber = 1249}
(gdb) p *request_lsns 
$2 = {request_lsn = 18446744073709551615, not_modified_since = 900516352, effective_request_lsn = 900623288}

I'll try to make a docker file for reproducing this...

@knizhnik
Copy link
Contributor

knizhnik commented Jan 8, 2025

And are you able to reproduce it using release build?

@alexanderlaw
Copy link
Author

Yes, I've reproduced it with release too:

3: sleep 10
Wed Jan  8 17:09:34 EET 2025
TIMEOUT
Wed Jan  8 17:09:54 EET 2025
a        2540589    3382  0 17:07 ?        00:00:00 /t/neon/pg_install/v16/bin/postgres -D /t/neon/.neon/storage_controller_db
a        2540590 2540589  0 17:07 ?        00:00:00 postgres: checkpointer 
a        2540591 2540589  0 17:07 ?        00:00:00 postgres: background writer 
a        2540593 2540589  0 17:07 ?        00:00:00 postgres: walwriter 
a        2540594 2540589  0 17:07 ?        00:00:00 postgres: autovacuum launcher 
a        2540595 2540589  0 17:07 ?        00:00:00 postgres: logical replication launcher 
a        2540601    3382  0 17:07 pts/1    00:00:00 /t/neon/target/release/storage_controller -l 127.0.0.1:1234 ...
a        2540733 2540513  0 17:07 pts/1    00:00:00 /t/neon/pg_install/v17/bin/postgres --wal-redo --tenant-shard-id 1e5543033d16ba5550a00e279cf4c1a5
a        2540958    3382  0 17:08 pts/1    00:00:00 /t/neon/target/release/compute_ctl --http-port 55433 ...
a        2540978 2540958  0 17:08 pts/1    00:00:00 /t/neon/pg_install/v17/bin/postgres -D /t/neon/.neon/endpoints/main/pgdata
a        2540981 2540978  0 17:08 ?        00:00:00 postgres: checkpointer 
a        2540982 2540978  0 17:08 ?        00:00:00 postgres: background writer 
a        2540984 2540978  0 17:08 ?        00:00:00 postgres: walwriter 
a        2540985 2540978  0 17:08 ?        00:00:00 postgres: autovacuum launcher 
a        2540986 2540978  0 17:08 ?        00:00:00 postgres: Logical replication monitor 
a        2540987 2540978  4 17:08 ?        00:00:02 postgres: WAL proposer streaming 2/21EA6900
a        2540988 2540978  0 17:08 ?        00:00:00 postgres: logical replication launcher 
a        2540993 2540978  0 17:08 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(60058) BIND
a        2541000 2540978 53 17:08 ?        00:00:32 postgres: cloud_admin postgres 127.0.0.1(60074) DObackpressure throttling: lag 92416
a        2541051 2540589  0 17:09 ?        00:00:00 postgres: a storage_controller 127.0.0.1(49328) idle
a        2541097 2540978  0 17:09 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(59252) authentication
a        2541117 2540488  0 17:09 pts/1    00:00:00 grep postgres

Sorry for the delay -- it took me a while to understand that "cargo neon {command}" doesn't work with the release build.

@alexanderlaw
Copy link
Author

And one more time:

a        2543400 2543380  0 17:28 pts/1    00:00:00 /t/neon/pg_install/v17/bin/postgres -D /t/neon/.neon/endpoints/main/pgdata
a        2543403 2543400  0 17:28 ?        00:00:00 postgres: checkpointer 
a        2543404 2543400  0 17:28 ?        00:00:00 postgres: background writer 
a        2543406 2543400  0 17:28 ?        00:00:00 postgres: walwriter 
a        2543407 2543400  0 17:28 ?        00:00:00 postgres: autovacuum launcher 
a        2543408 2543400  0 17:28 ?        00:00:00 postgres: Logical replication monitor 
a        2543409 2543400  5 17:28 ?        00:00:02 postgres: WAL proposer streaming 2/2EA7F078
a        2543410 2543400  0 17:28 ?        00:00:00 postgres: logical replication launcher 
a        2543415 2543400  0 17:28 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(41172) BIND
a        2543420 2543400 53 17:28 ?        00:00:31 postgres: cloud_admin postgres 127.0.0.1(41176) DObackpressure throttling: lag 53368
a        2543527 2543400  0 17:29 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(52574) authentication
a        2543536 2542989  0 17:29 ?        00:00:00 postgres: a storage_controller 127.0.0.1(59410) idle
a        2543549 2542989  0 17:29 ?        00:00:00 postgres: a storage_controller 127.0.0.1(53650) idle

with the backtrace of the backend waiting (inside check_hba()):

Attaching to process 2543527

(gdb) bt
#0  0x00007edc6e52a007 in epoll_wait (epfd=8, events=0x5d4696360060, maxevents=1, timeout=timeout@entry=9999)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005d46793577d9 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffc91fb10c0, cur_timeout=9999, set=0x5d469635ffe0)
    at /t/neon//vendor/postgres-v17/src/backend/storage/ipc/latch.c:1570
#2  WaitEventSetWait (set=0x5d469635ffe0, timeout=<optimized out>, occurred_events=occurred_events@entry=0x7ffc91fb1240, 
    nevents=nevents@entry=1, wait_event_info=<optimized out>) at /t/neon//vendor/postgres-v17/src/backend/storage/ipc/latch.c:1516
#3  0x00007edc6efaff79 in call_PQgetCopyData (buffer=0x7ffc91fb1220, shard_no=0) at /t/neon//pgxn/neon/libpagestore.c:700
#4  pageserver_receive (shard_no=0) at /t/neon//pgxn/neon/libpagestore.c:896
#5  0x00007edc6efb564a in prefetch_read (slot=0x5d4696373f88) at /t/neon//pgxn/neon/pagestore_smgr.c:644
#6  prefetch_wait_for (ring_index=ring_index@entry=0) at /t/neon//pgxn/neon/pagestore_smgr.c:596
#7  0x00007edc6efb93de in neon_read_at_lsnv (rinfo=..., forkNum=forkNum@entry=MAIN_FORKNUM, base_blockno=base_blockno@entry=0, 
    request_lsns=request_lsns@entry=0x7ffc91fb1510, buffers=buffers@entry=0x7ffc91fb1920, nblocks=<optimized out>, 
    mask=<optimized out>) at /t/neon//pgxn/neon/pagestore_smgr.c:2944
#8  0x00007edc6efb96d1 in neon_readv (reln=0x5d46963db408, forknum=MAIN_FORKNUM, blocknum=0, buffers=0x7ffc91fb1920, nblocks=1)
    at /t/neon//pgxn/neon/pagestore_smgr.c:3174
#9  0x00005d46793453ad in WaitReadBuffers (operation=operation@entry=0x7ffc91fb1ad0)
    at /t/neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:1517
#10 0x00005d4679349d90 in ReadBuffer_common (strategy=0x0, mode=RBM_NORMAL, blockNum=0, forkNum=MAIN_FORKNUM, 
    smgr_persistence=0 '\000', smgr=<optimized out>, rel=<optimized out>)
    at /t/neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:1265
#11 ReadBufferExtended (reln=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=0, mode=mode@entry=RBM_NORMAL, 
    strategy=strategy@entry=0x0) at /t/neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:816
#12 0x00005d467934a1a2 in ReadBuffer (reln=<optimized out>, blockNum=<optimized out>)
    at /t/neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:752
#13 0x00005d467908b78a in _bt_getbuf (access=1, blkno=0, rel=0x7edc6e2a0870)
    at /t/neon//vendor/postgres-v17/src/backend/access/nbtree/nbtpage.c:852
#14 _bt_metaversion (rel=rel@entry=0x7edc6e2a0870, heapkeyspace=heapkeyspace@entry=0x7ffc91fb25d0, 
    allequalimage=allequalimage@entry=0x7ffc91fb25d1) at /t/neon//vendor/postgres-v17/src/backend/access/nbtree/nbtpage.c:747
#15 0x00005d46790922af in _bt_first (scan=scan@entry=0x5d46963bc690, dir=dir@entry=ForwardScanDirection)
    at /t/neon//vendor/postgres-v17/src/backend/access/nbtree/nbtsearch.c:1420
#16 0x00005d467908e455 in btgettuple (scan=0x5d46963bc690, dir=ForwardScanDirection)
    at /t/neon//vendor/postgres-v17/src/backend/access/nbtree/nbtree.c:224
#17 0x00005d4679086185 in index_getnext_tid (scan=scan@entry=0x5d46963bc690, direction=direction@entry=ForwardScanDirection)
    at /t/neon//vendor/postgres-v17/src/backend/access/index/indexam.c:590
#18 0x00005d46790862f3 in index_getnext_slot (scan=0x5d46963bc690, direction=direction@entry=ForwardScanDirection, 
    slot=0x5d46963bc500) at /t/neon//vendor/postgres-v17/src/backend/access/index/indexam.c:682
#19 0x00005d4679085596 in systable_getnext (sysscan=sysscan@entry=0x5d46963bc4b8)
    at /t/neon//vendor/postgres-v17/src/backend/access/index/genam.c:511
#20 0x00005d467949efff in SearchCatCacheMiss (cache=cache@entry=0x5d469638e300, nkeys=nkeys@entry=1, 
--Type <RET> for more, q to quit, c to continue without paging--
    hashValue=hashValue@entry=3028463561, hashIndex=hashIndex@entry=1, v1=v1@entry=102557749207840, v2=v2@entry=0, v3=0, v4=0)
    at /t/neon//vendor/postgres-v17/src/backend/utils/cache/catcache.c:1494
#21 0x00005d467949f2be in SearchCatCacheInternal (cache=0x5d469638e300, nkeys=1, v1=102557749207840, v1@entry=1, v2=v2@entry=0, 
    v3=v3@entry=0, v4=v4@entry=0) at /t/neon//vendor/postgres-v17/src/backend/utils/cache/catcache.c:1414
#22 0x00005d467949fb58 in SearchCatCache (cache=<optimized out>, v1=v1@entry=1, v2=v2@entry=0, v3=v3@entry=0, v4=v4@entry=0)
    at /t/neon//vendor/postgres-v17/src/backend/utils/cache/catcache.c:1268
#23 0x00005d46794b12c0 in SearchSysCache (key4=0, key3=0, key2=0, key1=1, cacheId=<optimized out>)
    at /t/neon//vendor/postgres-v17/src/backend/utils/cache/syscache.c:217
#24 GetSysCacheOid (cacheId=cacheId@entry=10, oidcol=oidcol@entry=1, key1=key1@entry=102557749207840, key2=key2@entry=0, 
    key3=key3@entry=0, key4=key4@entry=0) at /t/neon//vendor/postgres-v17/src/backend/utils/cache/syscache.c:455
#25 0x00005d46793a0409 in get_role_oid (rolname=0x5d469635e320 "cloud_admin", missing_ok=missing_ok@entry=true)
    at /t/neon//vendor/postgres-v17/src/backend/utils/adt/acl.c:5424
#26 0x00005d46792239e9 in check_hba (port=<optimized out>) at /t/neon//vendor/postgres-v17/src/backend/libpq/hba.c:2476
#27 hba_getauthmethod (port=port@entry=0x5d46962b2ef0) at /t/neon//vendor/postgres-v17/src/backend/libpq/hba.c:3050
#28 0x00005d467921a78a in ClientAuthentication (port=port@entry=0x5d46962b2ef0)
    at /t/neon//vendor/postgres-v17/src/backend/libpq/auth.c:393
#29 0x00005d46794c7262 in PerformAuthentication (port=<optimized out>)
    at /t/neon//vendor/postgres-v17/src/backend/utils/init/postinit.c:247
#30 InitPostgres (in_dbname=in_dbname@entry=0x5d469635e338 "postgres", dboid=dboid@entry=0, 
    username=username@entry=0x5d469635e320 "cloud_admin", useroid=useroid@entry=0, flags=1, out_dbname=out_dbname@entry=0x0)
    at /t/neon//vendor/postgres-v17/src/backend/utils/init/postinit.c:941
#31 0x00005d467937d66c in PostgresMain (dbname=0x5d469635e338 "postgres", username=0x5d469635e320 "cloud_admin")
    at /t/neon//vendor/postgres-v17/src/backend/tcop/postgres.c:4334
#32 0x00005d4679379765 in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>)
    at /t/neon//vendor/postgres-v17/src/backend/tcop/backend_startup.c:105
#33 0x00005d46792e9c1f in postmaster_child_launch (child_type=child_type@entry=B_BACKEND, 
    startup_data=startup_data@entry=0x7ffc91fb4fc0 "", startup_data_len=startup_data_len@entry=4, 
    client_sock=client_sock@entry=0x7ffc91fb4fe0) at /t/neon//vendor/postgres-v17/src/backend/postmaster/launch_backend.c:277
#34 0x00005d46792ed709 in BackendStartup (client_sock=0x7ffc91fb4fe0)
    at /t/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:3594
#35 ServerLoop () at /t/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1674
#36 0x00005d46792ef4b1 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x5d46962b1cc0)
    at /t/neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1372
#37 0x00005d467902b869 in main (argc=3, argv=0x5d46962b1cc0) at /t/neon//vendor/postgres-v17/src/backend/main/main.c:237

@MMeent
Copy link
Contributor

MMeent commented Jan 8, 2025

Yes, those are waiting for Pageserver to respond. If you can figure out why Pageserver doesn't respond fast enough then that'd be nice, but based on the info provided here this doesn't seem like it is a Compute issue.

@alexanderlaw
Copy link
Author

Yeah, with debug logging added, log_min_messages = DEBUG3, and:

-#define LOG_INTERVAL_MS                INT64CONST(10 * 1000)
+#define LOG_INTERVAL_MS                INT64CONST(1 * 1000)

I can see the following for a slow-starting backend:

PG:2025-01-08 20:17:13.478 GMT|[unknown]|[unknown]|677edd49.2bfd9a|LOG:  connection received: host=127.0.0.1 port=50250
PG:2025-01-08 20:17:13.479 GMT|cloud_admin|postgres|677edd49.2bfd9a|DEBUG:  InitPostgres
PG:2025-01-08 20:17:13.479 GMT|cloud_admin|postgres|677edd49.2bfd9a|DEBUG:  [NEON_SMGR] neon_get_request_lsns GetLastWrittenLSN lsn 0/A419218
PG:2025-01-08 20:17:13.480 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] libpagestore: connected to 'postgresql://no_user@localhost:64000' with protocol version 2
PG:2025-01-08 20:17:13.480 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!neon_read_at_lsnv| neon_prefetch_response_usable, ring_index: 0
PG:2025-01-08 20:17:13.480 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!neon_read_at_lsnv| before prefetch_wait_for(0)
PG:2025-01-08 20:17:13.480 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:14.481 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] no response received from pageserver for 1.001 s, still waiting (sent 1 requests, received 0 responses)
PG:2025-01-08 20:17:14.481 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:14.481 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
...
~270 such messages
...
PG:2025-01-08 20:17:15.481 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:15.481 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] no response received from pageserver for 2.001 s, still waiting (sent 1 requests, received 0 responses)
...
~another 270
...
PG:2025-01-08 20:17:16.481 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:16.481 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] no response received from pageserver for 3.001 s, still waiting (sent 1 requests, received 0 responses)
PG:2025-01-08 20:17:16.481 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
...
PG:2025-01-08 20:17:17.481 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] no response received from pageserver for 4.001 s, still waiting (sent 1 requests, received 0 responses)
PG:2025-01-08 20:17:17.481 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:17.481 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:17.548 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 8193
PG:2025-01-08 20:17:17.548 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] received response from pageserver after 4.067 s
...
PG:2025-01-08 20:17:17.548 GMT|cloud_admin|postgres|677edd49.2bfd9a|DEBUG:  [NEON_SMGR] neon_get_request_lsns GetLastWrittenLSN lsn 0/B2E5AE8
PG:2025-01-08 20:17:17.548 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:18.549 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] no response received from pageserver for 1.001 s, still waiting (sent 2 requests, received 1 responses)
PG:2025-01-08 20:17:18.549 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:18.549 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:19.549 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] no response received from pageserver for 2.001 s, still waiting (sent 2 requests, received 1 responses)
PG:2025-01-08 20:17:19.549 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:19.549 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:20.549 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] no response received from pageserver for 3.001 s, still waiting (sent 2 requests, received 1 responses)
PG:2025-01-08 20:17:20.549 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:20.549 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:21.485 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 8193
PG:2025-01-08 20:17:21.485 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] received response from pageserver after 3.937 s
...
PG:2025-01-08 20:17:21.485 GMT|cloud_admin|postgres|677edd49.2bfd9a|DEBUG:  [NEON_SMGR] neon_get_request_lsns GetLastWrittenLSN lsn 0/867ED08
PG:2025-01-08 20:17:21.485 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:21.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 8193
PG:2025-01-08 20:17:21.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  connection authenticated: user="cloud_admin" method=trust (/t/neon/.neon/endpoints/main/pgdata/pg_hba.conf:91)
PG:2025-01-08 20:17:21.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  connection authorized: user=cloud_admin database=postgres application_name=psql
PG:2025-01-08 20:17:21.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|DEBUG:  [NEON_SMGR] neon_get_request_lsns GetLastWrittenLSN lsn 0/C1AE030
PG:2025-01-08 20:17:21.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:22.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] no response received from pageserver for 1.000 s, still waiting (sent 4 requests, received 3 responses)
...
PG:2025-01-08 20:17:23.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:23.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] no response received from pageserver for 2.000 s, still waiting (sent 4 requests, received 3 responses)
PG:2025-01-08 20:17:23.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
...
PG:2025-01-08 20:17:25.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:25.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:25.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] no response received from pageserver for 4.000 s, still waiting (sent 4 requests, received 3 responses)
PG:2025-01-08 20:17:25.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:25.487 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 0
PG:2025-01-08 20:17:25.556 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  !!!call_PQgetCopyData| PQgetCopyData: 8193
PG:2025-01-08 20:17:25.556 GMT|cloud_admin|postgres|677edd49.2bfd9a|LOG:  [NEON_SMGR] [shard 0] received response from pageserver after 4.069 s

So reading each block by this backend takes about 4 seconds...

@knizhnik
Copy link
Contributor

knizhnik commented Jan 9, 2025

So looks like the problem is at PS side. Unfortunately I still not able to reproduce it locally. Can you send me please page server log? May be there is some useful information explaining lag of PS...

My current hypothesis is the following:

  • PS is not able to catchup with compute which produce a lot of WAL by inserting larger records in the table without indexes and small fill factor.
  • When lag with PS exceeds 15Mb, throttling is activated.
  • Looks like this ~4 seconds delay is time needed by PS to apply 15Mb of WAL.

4Mb of WAL per second seems to be too small. Also it is strange to me that this time is so stable.

Are you sure that you are using release binaries of Neon? I never started Neon locally using cargo and use NEON_BIN environment variable to specify path to the binaries.

@knizhnik
Copy link
Contributor

knizhnik commented Jan 9, 2025

This is CPU usage at my system:

9235  postgres     99.1  13:56.31 1/1    0    15    16M    0B     0B     29235 29208 running  *0[1]             0.00000 0.00000    501  1536      58
29152  pageserver   32.3  06:00.21 287    1    338   2600M+ 0B     619M-  29131 1     stuck    *0[1]             0.00000 0.00000    501  2648199+  172

As you see, backend consumes 100% CPU and page server - just 30% (but it may be more bounded by disk IO).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

5 participants