You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What problem would you like to solve? Please describe:
Currently, the schedule manager polls all schedules every 5 seconds, pulling configuration for every schedule (including any active shifts and overrides) in a single loop. This “all-or-nothing” approach can cause delays or failures for every schedule if one update encounters an issue. As GoAlert deployments grow larger, continuously reading and updating all schedules in one big transaction is increasingly inefficient and prone to blocking issues.
Describe the solution you’d like:
Job Queue Integration: Migrate the schedule manager to a fine-grained job queue (the River Queue).
Event-Driven Updates: When a schedule configuration changes—such as a new rule, an override, or a temporary schedule—the system enqueues a job for that specific schedule.
Future Tasks: Each update job can schedule the next update if there’s a known upcoming change (e.g., a rule scheduled to start/end at a specific time).
Fallback for Missed Updates: Include a mechanism to detect and recover from missed or untracked changes (e.g., DB restores, crashes, older GoAlert versions), ensuring schedules don’t become “stuck” with incorrect on-call data.
Selective & Isolated Transactions: Update only the schedules that need updating, rather than scanning and processing all schedules in one loop. Each schedule update runs in its own transaction, preventing a single failing update from blocking all others.
Scalability & Resilience: Avoid the heavy cost of pulling all schedule data every 5 seconds. This improves performance and allows multiple engine instances to run without duplicating or conflicting work.
Describe alternatives you’ve considered:
Maintain Interval-Based Processing: Tuning intervals or batching schedules still requires scanning all schedules every cycle, which can scale poorly and cause blocking issues.
Partial Batching: Process subsets of schedules each tick. While it reduces transaction size, it still relies on a common loop and can lead to contention and/or missed updates if any batch fails.
Hybrid Event + Periodic Sweeps: Partially adopt event-driven jobs while keeping a periodic sweep to catch missed updates. This adds complexity and duplicates logic that can be handled consistently by the job queue’s built-in mechanisms.
Additional context:
Schedule Manager Responsibilities:
Determines the current on-call users by applying schedule rules, overrides, and temporary schedules.
Updates schedule_on_call_users by adding a row (start_time = now()) when a user becomes on-call and marking end_time = now() when they go off-call.
Triggers notifications in two ways:
On-Change: When there is a change in on-call assignment since the last check.
At-Specific-Time: When a rule or override is scheduled to trigger at a specific time window.
Relation to Rotations: The schedule manager re-uses rotation data (rotation_state) to determine who is on-call. Rotations are being migrated first to the new job queue, and schedules will follow.
Multiple Engine Instances:
Currently, each instance loops and processes all schedules independently, leading to redundant work and potential contention.
The job queue approach ensures only one instance executes a given job at a time, making multiple engine instances more practical and performant.
Value: By moving to an event-driven model and isolating updates, GoAlert can handle large-scale scheduling needs more reliably—reducing the risk of blocking updates, improving resource usage, and delivering a more consistent on-call experience.
The text was updated successfully, but these errors were encountered:
What problem would you like to solve? Please describe:
Currently, the schedule manager polls all schedules every 5 seconds, pulling configuration for every schedule (including any active shifts and overrides) in a single loop. This “all-or-nothing” approach can cause delays or failures for every schedule if one update encounters an issue. As GoAlert deployments grow larger, continuously reading and updating all schedules in one big transaction is increasingly inefficient and prone to blocking issues.
Describe the solution you’d like:
Describe alternatives you’ve considered:
Additional context:
schedule_on_call_users
by adding a row (start_time = now()
) when a user becomes on-call and markingend_time = now()
when they go off-call.rotation_state
) to determine who is on-call. Rotations are being migrated first to the new job queue, and schedules will follow.The text was updated successfully, but these errors were encountered: