Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate Schedule Manager to Use the New Job Queue System #4245

Open
mastercactapus opened this issue Jan 15, 2025 · 0 comments
Open

Migrate Schedule Manager to Use the New Job Queue System #4245

mastercactapus opened this issue Jan 15, 2025 · 0 comments
Labels
enhancement New feature or request River

Comments

@mastercactapus
Copy link
Member

What problem would you like to solve? Please describe:
Currently, the schedule manager polls all schedules every 5 seconds, pulling configuration for every schedule (including any active shifts and overrides) in a single loop. This “all-or-nothing” approach can cause delays or failures for every schedule if one update encounters an issue. As GoAlert deployments grow larger, continuously reading and updating all schedules in one big transaction is increasingly inefficient and prone to blocking issues.

Describe the solution you’d like:

  • Job Queue Integration: Migrate the schedule manager to a fine-grained job queue (the River Queue).
    • Event-Driven Updates: When a schedule configuration changes—such as a new rule, an override, or a temporary schedule—the system enqueues a job for that specific schedule.
    • Future Tasks: Each update job can schedule the next update if there’s a known upcoming change (e.g., a rule scheduled to start/end at a specific time).
    • Fallback for Missed Updates: Include a mechanism to detect and recover from missed or untracked changes (e.g., DB restores, crashes, older GoAlert versions), ensuring schedules don’t become “stuck” with incorrect on-call data.
  • Selective & Isolated Transactions: Update only the schedules that need updating, rather than scanning and processing all schedules in one loop. Each schedule update runs in its own transaction, preventing a single failing update from blocking all others.
  • Scalability & Resilience: Avoid the heavy cost of pulling all schedule data every 5 seconds. This improves performance and allows multiple engine instances to run without duplicating or conflicting work.

Describe alternatives you’ve considered:

  1. Maintain Interval-Based Processing: Tuning intervals or batching schedules still requires scanning all schedules every cycle, which can scale poorly and cause blocking issues.
  2. Partial Batching: Process subsets of schedules each tick. While it reduces transaction size, it still relies on a common loop and can lead to contention and/or missed updates if any batch fails.
  3. Hybrid Event + Periodic Sweeps: Partially adopt event-driven jobs while keeping a periodic sweep to catch missed updates. This adds complexity and duplicates logic that can be handled consistently by the job queue’s built-in mechanisms.

Additional context:

  • Schedule Manager Responsibilities:
    • Determines the current on-call users by applying schedule rules, overrides, and temporary schedules.
    • Updates schedule_on_call_users by adding a row (start_time = now()) when a user becomes on-call and marking end_time = now() when they go off-call.
    • Triggers notifications in two ways:
      • On-Change: When there is a change in on-call assignment since the last check.
      • At-Specific-Time: When a rule or override is scheduled to trigger at a specific time window.
  • Relation to Rotations: The schedule manager re-uses rotation data (rotation_state) to determine who is on-call. Rotations are being migrated first to the new job queue, and schedules will follow.
  • Multiple Engine Instances:
    • Currently, each instance loops and processes all schedules independently, leading to redundant work and potential contention.
    • The job queue approach ensures only one instance executes a given job at a time, making multiple engine instances more practical and performant.
  • Value: By moving to an event-driven model and isolating updates, GoAlert can handle large-scale scheduling needs more reliably—reducing the risk of blocking updates, improving resource usage, and delivering a more consistent on-call experience.
@mastercactapus mastercactapus added enhancement New feature or request River labels Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request River
Projects
None yet
Development

No branches or pull requests

1 participant