Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart scheduler on error #7527

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mkindahl
Copy link
Contributor

@mkindahl mkindahl commented Dec 11, 2024

If the scheduler receives an error, it will never restart again since bgw_restart_time is set to BGW_NEVER_RESTART, which will prevent all jobs from executing.

This commit adds the GUC timescaledb.bgw_scheduler_restart_time that can be set to the restart time for the scheduler. It defaults to 60 seconds, which is the default restart interval for background workers PostgreSQL defines.

It also adds timescaledb.debug_bgw_scheduler_exit_status to be able to shutdown the scheduler with a non-zero exit status, which allows the restart functionality to be tested.

It also ensures that backend_type is explicitly set up rather than copied from application_name and add some more information to application_name. It also updates the tests to use backend_type where applicable.

Disable-check: loader-change

@mkindahl mkindahl force-pushed the fix-scheduler-restart branch 2 times, most recently from 0e54d38 to 33521cd Compare December 11, 2024 13:42
@mkindahl
Copy link
Contributor Author

mkindahl commented Dec 11, 2024

Compared to #6195 this adds code to the start of the launcher that handles the status for any schedulers that existed prior to the restart by killing them. Reproducing it in a test is difficult because of how the schedulers are used in tests, but this short script demonstrates the issue:

create view tsdb_bgw as
       select datname, pid, backend_type, application_name from pg_stat_activity
       where backend_type like '%TimescaleDB%';

alter system set timescaledb.debug_bgw_scheduler_exit_status to 1;
alter system set timescaledb.bgw_scheduler_restart_time to 10;
select pg_reload_conf();

-- Kill the schedulers
select pid, pg_terminate_backend(pid) from pg_stat_activity where backend_type like '%Scheduler%';
select pg_sleep(1);
select datname, backend_type, count(*) from tsdb_bgw group by datname, backend_type;

-- Wait for the schedulers to restart.
select pg_sleep(10);
select datname, backend_type, count(*) from tsdb_bgw group by datname, backend_type;

-- Kill the launcher, it should restart immediately.
select pid, pg_terminate_backend(pid) from pg_stat_activity where backend_type like '%Launcher%';
select pg_sleep(1);

-- This should only show one scheduler for each database.
select datname, backend_type, count(*) from tsdb_bgw group by datname, backend_type;

@mkindahl mkindahl force-pushed the fix-scheduler-restart branch 3 times, most recently from e8e94bd to cbd7583 Compare December 12, 2024 12:38
Copy link

codecov bot commented Dec 12, 2024

Codecov Report

Attention: Patch coverage is 72.72727% with 9 lines in your changes missing coverage. Please review.

Project coverage is 82.30%. Comparing base (59f50f2) to head (48cdcb9).
Report is 687 commits behind head on main.

Files with missing lines Patch % Lines
src/loader/bgw_launcher.c 67.85% 5 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7527      +/-   ##
==========================================
+ Coverage   80.06%   82.30%   +2.24%     
==========================================
  Files         190      238      +48     
  Lines       37181    43746    +6565     
  Branches     9450    10980    +1530     
==========================================
+ Hits        29770    36007    +6237     
- Misses       2997     3406     +409     
+ Partials     4414     4333      -81     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mkindahl mkindahl force-pushed the fix-scheduler-restart branch 5 times, most recently from 42187ae to c167a6b Compare December 17, 2024 12:21
@mkindahl mkindahl marked this pull request as ready for review December 17, 2024 12:30
@mkindahl mkindahl self-assigned this Dec 17, 2024
@mkindahl mkindahl added the bgw The background worker subsystem, including the scheduler label Dec 17, 2024
@mkindahl mkindahl linked an issue Dec 18, 2024 that may be closed by this pull request
@mkindahl mkindahl force-pushed the fix-scheduler-restart branch 7 times, most recently from 39e827c to bb093c2 Compare December 19, 2024 14:30
src/guc.c Show resolved Hide resolved
src/guc.c Show resolved Hide resolved
src/loader/bgw_launcher.c Show resolved Hide resolved
tsl/test/expected/bgw_scheduler_restart.out Outdated Show resolved Hide resolved
@mkindahl mkindahl force-pushed the fix-scheduler-restart branch 3 times, most recently from 8135f6c to bf26586 Compare December 21, 2024 10:12
@mkindahl mkindahl force-pushed the fix-scheduler-restart branch 3 times, most recently from c4a2650 to 7dadb19 Compare December 21, 2024 14:05
@mkindahl mkindahl force-pushed the fix-scheduler-restart branch 2 times, most recently from 858040d to 990dc77 Compare January 7, 2025 13:51
@mkindahl mkindahl force-pushed the fix-scheduler-restart branch 2 times, most recently from 4e74f91 to e66c69b Compare January 8, 2025 11:26
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit adds the GUC `timescaledb.bgw_scheduler_restart_time` that
can be set to the restart time for the scheduler. It defaults
to 60 seconds, which is the default restart interval for background
workers defined by PostgreSQL.

It also adds `timescaledb.debug_bgw_scheduler_exit_status` to be able
to shutdown the scheduler with a non-zero exit status, which allows the
restart functionality to be tested.

It also ensures that `backend_type` is explicitly set up rather than
copied from `application_name` and add some more information to
`application_name`. It also updates the tests to use `backend_type`
where applicable.

To avoid exhausting slots when the launcher restarts, it will kill all
existing schedulers and start new ones. Since background worker slots
are easily exhausted in the `bgw_launcher` test, we do not run it
repeatedly in the flakes workflow.
@mkindahl mkindahl force-pushed the fix-scheduler-restart branch from e66c69b to 48cdcb9 Compare January 9, 2025 08:09
@mkindahl mkindahl enabled auto-merge (rebase) January 9, 2025 14:04
@mkindahl mkindahl added this to the TimescaleDB 2.18.0 milestone Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgw The background worker subsystem, including the scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Automatically restart background worker scheduler
4 participants