This document describes the workflow for validating external resources (link checking) and integrating with the Internet Archive's Wayback Machine.
SECTIONS
- Overview
- Enabling Tasks
- Wayback Machine Integration
- Wayback Machine Removal Requests
- Frequency Control
- Rate Limiting
- Task Priority
- Management Commands
- Code References
This assumes that Celery beat scheduler is installed and enabled, which is required for the task scheduling.
The frequency for external resource checking is set to once per week (1/week
), as defined by the CHECK_EXTERNAL_RESOURCE_STATUS_FREQUENCY
variable. All external resources, both new and existing, are validated weekly, regardless of their last status.
The Wayback Machine integration is enabled by default. Valid external resources are submitted for archiving as part of the external resource validation task. The system monitors the submission status and avoids resubmitting resources within a specified interval of 30
days, as defined by the WAYBACK_SUBMISSION_INTERVAL_DAYS
variable.
The status of Wayback Machine archiving jobs is updated every 6
hours to track the progress of pending jobs, as defined by the UPDATE_WAYBACK_JOBS_STATUS_FREQUENCY
variable.
These variables can be found in settings.py.
If a resource is valid, the check_external_resources
task triggers the submit_url_to_wayback_task
to archive it.
Here’s a high-level description of the process:
- Task is automatically added in scheduler on system start.
- On execution, all available external resources are retrieved from DB.
- Gathered data is divided into preconfigured batch sizes, defined by
BATCH_SIZE_EXTERNAL_RESOURCE_STATUS_CHECK
in websites/constants.py, which is currently set to100
. - All batches are grouped into a single Celery task and executed.
- Each batch-task iterates over the batch to validate availability of each resource and its backup resource if available.
- The status of resource is then added to DB.
- Batch tasks have a preconfigured rate limiter and lower priority by default.
- When external resource validation occurs, valid external resources are submitted to the Wayback Machine for archiving.
- The status of Wayback Machine archiving jobs is tracked and updated periodically (currently, every 6 hours).
- New external resources are automatically submitted to the Wayback Machine upon creation.
The task for external resource checking can be enabled/disabled using the ENABLE_CHECK_EXTERNAL_RESOURCE_TASK
defined in settings.py. By default, this task is enabled.
Wayback Machine tasks are governed by PostHog feature flags, with the feature flag key OCW_STUDIO_WAYBACK_MACHINE_TASKS
defined in ENABLE_WAYBACK_TASKS
in constants.py. In addition to this, the ENABLE_WAYBACK_TASKS
setting in settings.py is responsible for adding the update_wayback_jobs_status_batch
task to the Celery beat schedule.
Note: Changes to PostHog feature flags take effect immediately, but modifications to Celery beat schedules still require a restart.
Restart Instructions:
- Local Development: Restart the Celery container (
ocw-studio-celery-1
). - RC/Production (Heroku):
worker
andextra_worker
are the relevant dynos. You can simply click on "Restart all dynos" from the Heroku dashboard.
When Wayback Machine tasks are enabled, the system performs the following actions:
- Submit Valid URLs:
- After an external resource URL is validated and found to be valid, it is submitted to the Wayback Machine for archiving.
- The submission is handled by the
submit_url_to_wayback_task
task.
- Automatically Submit on Creation:
- When a new external resource is created, it is automatically submitted to the Wayback Machine via the Django signal in signals.py.
- Track Job Statuses:
- The status of Wayback Machine archiving jobs is tracked using the
wayback_status
field in theExternalResourceState
model. - The system periodically updates the status of pending jobs using the
update_wayback_jobs_status_batch
task at an interval set byUPDATE_WAYBACK_JOBS_STATUS_FREQUENCY
in settings.py. Currently, it is set to6
hours (21600 seconds).
- The status of Wayback Machine archiving jobs is tracked using the
- Control Resubmissions:
- The system avoids resubmitting the same URL to the Wayback Machine within a specified interval, defined by
WAYBACK_SUBMISSION_INTERVAL_DAYS
in settings.py. Currently, it is set to30
days. - This interval can be overridden using the
submit_sites_to_wayback
management command with--force
flag (as detailed below).
- The system avoids resubmitting the same URL to the Wayback Machine within a specified interval, defined by
To request removal of a URL from the Wayback Machine, email info@archive.org with the details mentioned in this guide.
-
External Resource Checking:
- The frequency of the external resource checking task is set using
CHECK_EXTERNAL_RESOURCE_STATUS_FREQUENCY
in settings.py. - The default value is 604800 seconds (1 week).
- The task checks all external resources for availability regardless of their last status.
- The frequency of the external resource checking task is set using
-
Wayback Machine Status Updates:
- The frequency of updating Wayback Machine job statuses in the task
update_wayback_jobs_status_batch
is controlled byUPDATE_WAYBACK_JOBS_STATUS_FREQUENCY
in settings.py. - The default value is 21600 seconds (6 hours).
- The task only checks external resources with wayback_status of
pending
. - The chunk size is provided by
BATCH_SIZE_WAYBACK_STATUS_UPDATE
in constants.py. Currently, it is set to50
.
- The frequency of updating Wayback Machine job statuses in the task
-
External Resource Checking:
- The rate limit for the external resource checking tasks is set using
EXTERNAL_RESOURCE_TASK_RATE_LIMIT
in constants.py. - The assigned rate limit value is
100/s
.
- The rate limit for the external resource checking tasks is set using
-
Wayback Machine Tasks:
- The rate limit for Wayback Machine submission tasks is set using
WAYBACK_MACHINE_TASK_RATE_LIMIT
in constants.py. - The assigned rate limit value is
0.11/s
, which means approximately1
request every9
seconds.
- The rate limit for Wayback Machine submission tasks is set using
-
External Resource Checking:
- Batch-task priority is set using the
EXTERNAL_RESOURCE_TASK_PRIORITY
in constants.py. - External resource tasks have the lowest priority (
4
) by default, out of range0(highest) - 4(lowest)
.
- Batch-task priority is set using the
-
Wayback Machine Tasks:
- The priority for Wayback Machine submission tasks is set using
WAYBACK_MACHINE_SUBMISSION_TASK_PRIORITY
in constants.py. - The assigned priority value is currently set to
3
.
- The priority for Wayback Machine submission tasks is set using
Note: Priority levels and Celery default task priority can be configured by PRIORITY_STEPS
and DEFAULT_PRIORITY
, respectively, in main/constants.py.
Two management commands are available to interact with the external resources' Wayback Machine functionality:
- Submitting Resources to Wayback Machine:
- Command:
submit_sites_to_wayback
. - Usage:
- Submits all external resources for specified websites to the Wayback Machine.
- Supports website filtering via options (e.g.,
--filter course-name
). - Use the
--force
flag to force submission even if resources were submitted recently (bypassingWAYBACK_SUBMISSION_INTERVAL_DAYS
logic).
- Example Usage:
- Command:
./manage.py submit_sites_to_wayback --filter "example-site"
./manage.py submit_sites_to_wayback --filter "example-site" --force
- Updating Wayback Machine Statuses:
- Command:
update_wayback_status
. - Usage:
- Updates the status of pending Wayback Machine jobs for specified websites.
- Supports website filtering via the options.
- Use the
--sync
flag to run updates synchronously; note that this requires specifying website filters.
- Example Usage:
- Command:
./manage.py update_wayback_status --filter "example-site" --sync
./manage.py update_wayback_status --filter "example-site"
./manage.py update_wayback_status
- Models:
ExternalResourceState
: Stores the state and Wayback Machine information for external resources.
- Tasks:
check_external_resources
: Checks external resources for broken links.submit_url_to_wayback_task
: Submits external resource URLs to the Wayback Machine. This task is linked withcheck_external_resources
and will only send valid external resources.update_wayback_jobs_status_batch
: Updates the status of Wayback Machine archiving jobs.
- API:
api.py
: Contains functions for checking URLs and interacting with the Wayback Machine API.
- Signals:
signals.py
: CreatesExternalResourceState
for the External Resource (WebsiteContent
), and submits the link to the Wayback Machine upon creation.
- Celery Configuration:
celery.py
: Configures the Celery app with task routing, priority, and other settings for handling external resource validation and Wayback Machine integration.
- Constants and Settings:
external_resources/constants.py
,websites/constants.py
: Defines constants such as batch sizes, task priorities, and rate limits for external resource checking and Wayback Machine tasks.- Settings in
main/settings.py
control tasks enabling (used intasks.py
), frequency, and other configurations.
- Deployment Configuration:
app.json
: Defines deployment-specific environment variables and metadata for Heroku, including placeholders for Wayback Machine API credentials, submission intervals, and app configuration details.