-
Notifications
You must be signed in to change notification settings - Fork 316
CircleCI scheduled nightly pa11y scan
We perform automatic accessibility testing with pa11y in two ways:
- a full scan of the entire site, run nightly
- smaller targeted scans for each push, build, or pull request (PR), based on what files were changed
Targeted scans take less than 5 minutes, and the full nightly scan continues to check accessibility site-wide. Previously, full scans on every build would take over 25 minutes on CircleCI.
An 18F team member identified that pa11y runs were too slow, and created issue #3752.
We investigated and determined that pa11y runs took about 25 minutes, because every build and PR would check all files, and there were over a thousand URLs being checked. This length of time impeded contributions and updates to the site.
For each push, build, and pull request (PR), we scan only the files that need to be scanned.
Here's a rough sketch of the algorithm:
- If a post or page has changed, it should be scanned
- If a post or page's layout has changed, the post or page should be scanned
- If files that affect many pages are changed, such as layouts or partials, collect a random sample of files across the entire site to scan — 3 from every collection, and all the pages that live in the
_site/
root folder.
The targeted scan outputs the list of files to be scan to a file PA11Y_TARGETS
. The CI job uses this file to focus the build / PR scan and keep scanning times down.
Note
Note: This is a naive algorithm, but it's good enough, especially with the nightly scan as backup. For instance, if a partial or layout was changed, we'd ideally only scan pages which implement the changed layouts or partials — and perhaps only a sample then.
We run a full scan nightly, around 5am Eastern Time. The scan creates a GitHub issue in the repository if there are any errors.
Tip
Read this if you're a developer or trying to understand the details of how this scanning strategy is implemented.
We use Jekyll hooks (:documents
and :pages
) to determine what files have changed according to git. Changes to files within assets/
, _includes
, and _sass
cause the plugin to sample 3 files (or less if there aren't 3 files to scan) from the blog, all collections listed in the Jekyll config, and 3 of the blog archive pages. Once the plugin determines what file(s) have changed and should be scanned it outputs those to a file named pa11y_targets
which is used in the CI environment to let pa11y know what files should be scanned.
Once the Jekyll build completes, a shell script is ran from the CircleCI config that checks for the existence of pa11y_targets
. If the file is not found then the pa11y scan is skipped. If pa11y_targets
exists then its contents are base64 encoded and pushed into an environment variable ($PA11Y_TARGETS
) within CircleCI's $BASH_ENV
area which lets some stateful information exist between job steps which are otherwise run "fresh". When it's time to run the pa11y scan, we base64 decode the contents of $PA11Y_TARGETS
and pass that file list to pa11y. This is where the reduction in pa11y scan times on most pull requests comes from.
A full pa11y scan is performed against main
every morning at 4:58am ET. The CircleCI config reads pipeline.trigger_source
to know whether it should do a full scan or not. If pipeline.trigger_source
is schedule
then a full pa11y scan using the sitemap is ran which takes around 25 minutes as of June 2024. During the full scan, pa11y outputs the scanning of each url to stdout and if there are errors it outputs to stderr. A tee
command is used to cause the stderr output to be duplicated into a file named pa11y-errors
for sending to GitHub. When pa11y exits without errors it will exit with a status code of 0 but if there are errors it will exit with a status code of 2, that information is used to conditionally make a call to the GitHub API to report the pa11y errors.
At the end of the nightly pa11y scan if errors are detected a POST is made using cURL to https://api.github.com/repos/18f/18f.gsa.gov/dispatches
. This POST sends a JSON body to GitHub, the JSON is formed using the jq
utility and includes the base64 encoded contents of the pa11y-errors
file in the client_payload
key of the JSON. This POST causes a GitHub action that creates a new issue to be executed.
The authentication token used to call GitHub is stored as a CircleCI environment variable named GITHUB_TOKEN
. This token is a fine-grained personal access token created by Caley Woods (caley.woods@gsa.gov) that has contents:write
access granted just to the 18f/18f.gsa.gov
repository, the resource owner is the 18F organization but the token has no permissions to the 18F GitHub organization.
Reach out to (insert principal engineer who owns the token here, temporarily it's Caley Woods caley.woods@gsa.gov) as well as the GitHub admins in #admins-github
on Slack to have the token revoked. If an incident has taken place where the token was used maliciously, follow the security incident portion of the TTS handbook.
The token is generated to have the maximum lifetime of one year. To create a new token, visit the personal access tokens area of your GitHub settings and click "Generate new token". Select "Custom" from the Expiration field and then use the date selector to push the date out one year into the future. Under "Resource owner" select 18F and then write a brief justification description about what this token does and why it's needed, the token has to be approved by the GitHub admins before it can be used. Under "Repository access" select "Only select repositories" and from the dropdown pick 18F/18f.gsa.gov
as the repo. Under the "Permissions" section click "Repository permission" to expand the section and scroll down to find "Contents" and set the access level of Contents to Read and write. Write access is required to Contents or the GitHub API will return an error saying that the token does not have access to the repo. Scroll down to the "Overview" area at the bottom of the page and verify that your token has read and write access to Contents as well as read access to Metadata which will be automatically applied by GitHub, the token should have zero organization permissions.
After the token is created, copy its value from the GitHub UI and replace the GITHUB_TOKEN
environment variable for the 18f.gsa.gov project within CircleCI. As long as the personal access token has been approved no other changes are required. If you're unsure whether or not the token is approved, you'll see a "Pending" badge displayed on the token in the tokens list.
Note
Note: GitHub does allow you to regenerate a personal access token and extend its duration but in testing this a "bad credentials" error was encountered after the token had expired and was regenerated. For this reason it's recommended to create a new personal access token and update CircleCI with the new token when the old token is approaching its expiration date.
The new issue GitHub action works on the repository_dispatch
event which is started by the GitHub API call to the /dispatches
endpoint mentioned above in the GitHub API step. The action receives the base64 encoded error output from pa11y, decodes it, and uses it to create a new GitHub issue with the pa11y error output in the issue body.
The other source of pipeline.trigger_source
is webhook
and those are events pushed to CircleCI from GitHub when commits to a branch with a pull request are made, this trigger causes the Jekyll changed files logic to be followed.
All content is in the public domain and released CC0 where appropriate per 18F's open source policy.