Skip to content

Commit

Permalink
Merge pull request #10 from cfpb/script-improvements
Browse files Browse the repository at this point in the history
Shell usability improvements, plus logging
  • Loading branch information
chosak authored Nov 2, 2020
2 parents 3da6511 + 381d252 commit af7512d
Show file tree
Hide file tree
Showing 6 changed files with 109 additions and 33 deletions.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
/CHANGELOG.md merge=union
*.log linguist-generated=true
19 changes: 9 additions & 10 deletions .github/workflows/crawl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,23 +17,22 @@ jobs:
run: rm -r www.consumerfinance.gov

- name: Run the crawl script
run: ./crawl.sh
continue-on-error: true
run: ./crawl.sh -d 4 https://www.consumerfinance.gov

- name: Remove <script> tags and blank lines from crawl results
run: ./transform_results.sh
continue-on-error: true

- name: Prepare COMMIT_MESSAGE variable
- name: Prepare commit message
run: ./generate_summary.sh
continue-on-error: true

- name: Commit crawl results back to GitHub
uses: EndBug/add-and-commit@v5
with:
add: 'www.consumerfinance.gov'
author_name: Automated
author_email: actions@users.noreply.github.com
message: ${{ env.COMMIT_MESSAGE }}
run: |
gzip *.log
git add .
git config user.email actions@users.noreply.github.com
git config user.name Automated
git commit -F commit.txt
git push
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,8 @@ bower_components/
.grunt/
src/vendor/
dist/

# Project specific #
####################
commit.txt
!*.log.gz
26 changes: 20 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,31 @@ To get a copy of the consumerfinance.gov archive or run a crawl on your computer

To view the consumerfinance.gov archive, you can browse the history of this repo here on github.com, or clone this repository.

To run a crawl from your computer, `cd` into the root of this project and use the following command: `./crawl.sh`.
A full crawl usually takes over two hours.
To modify the parameters of the crawl, such as the target domain or which pages to include, edit `crawl.sh`.
To run a crawl on your computer, `cd` into the root of this project and use the following command:

```sh
./crawl.sh https://www.consumerfinance.gov
```

A full crawl can take several hours. To limit the crawl depth:

```sh
./crawl.sh -d 4 https://www.consumerfinance.gov
```

Or, to start the crawl at a specific URL:

```sh
./crawl.sh https://www.consumerfinance.gov/start/crawl/here/
```

## Known issues

The crawl has some constraints and limitations.
- The results only contain pages that share the same domain: www.consumerfinance.gov
- Pages may exist on consumerfinance.gov that are not linked to. If so, they will not appear in crawl results.
- The results intentionally only contain pages that share the same domain.
- The crawl will not include any pages that are not linked to from any other page reachable from the site root.
- The crawl records each page based on its url.
If we accidentally record a page with url parameters, it counts that as a separate page, which could result in duplication
If we accidentally record a page with url parameters, it counts that as a separate page, which could result in duplication.
- There are some pages on consumerfinance.gov that can only be found by paging through paginated lists of results.
We try to configure the crawl to find and download all of these pages, but it's possible there will be omissions.

Expand Down
71 changes: 64 additions & 7 deletions crawl.sh
Original file line number Diff line number Diff line change
@@ -1,16 +1,73 @@
#!/usr/bin/env bash

# Recursively crawl a website and save its HTML locally.
#
# Example usage:
#
# ./crawl.sh [-d depth] https://www.consumerfinance.gov
#
# Optionally specify -d depth to limit the crawl depth.

# If a command fails, stop executing this script and return its error code.
set -e

depth=0

while getopts ":d:" opt; do
case $opt in
d )
depth="$OPTARG";
number_regex='^[0-9]+$'
if ! [[ $depth =~ $number_regex ]] ; then
echo "Crawl depth must be a number." 1>&2
exit 1
fi
;;
\? )
echo "Invalid option: -$OPTARG." 1>&2
exit 1
;;
: )
echo "Invalid option: -$OPTARG requires an argument." 1>&2
exit 1
;;
esac
done

shift $((OPTIND -1))

url=$1

if [ -z "$url" ]; then
echo "Must specify URL to crawl."
exit 1
fi

echo "Starting crawl at $url."

if [ $depth -ne 0 ]; then
echo "Limiting crawl to depth $depth."
fi

domain=$url
domain="${domain#http://}"
domain="${domain#https://}"
domain="${domain%%:*}"
domain="${domain%%\?*}"
domain="${domain%%/*}"
echo "Limiting crawl to domain $domain."

time wget \
--domains=www.consumerfinance.gov \
--exclude-domains=files.consumerfinance.gov \
--domains="$domain" \
--execute robots=off \
--follow-tags=a \
--limit-rate=200k \
--random-wait \
--limit-rate=1m \
--accept html \
--reject-regex "topics=|authors=|categories=|filter_blog_category=|ext_url=|search_field=|issuer_name=" \
--recursive \
--level=4 \
--level="$depth" \
--trust-server-names \
--no-verbose \
--verbose \
--no-clobber \
https://www.consumerfinance.gov/
--rejected-log=rejected.log \
"$url" 2>&1 | tee wget.log
20 changes: 10 additions & 10 deletions generate_summary.sh
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
#!/usr/bin/env bash

# Generate a summary of the crawl results into message.txt
git add -A www.consumerfinance.gov
git diff --staged --compact-summary --no-color > message.txt
git reset .
# If a command fails, stop executing this script and return its error code.
set -e

# Follow these instructions to set the value of the COMMIT_MESSAGE
# environment variable and save it to the GitHub Actions environment:
# https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-commands-for-github-actions#setting-an-environment-variable
echo 'COMMIT_MESSAGE<<EOF' >> $GITHUB_ENV
echo $(cat message.txt) >> $GITHUB_ENV
echo 'EOF' >> $GITHUB_ENV
# Generate a summary of the crawl results into commit.txt
date > commit.txt
cat >> commit.txt <<EOL
lines |lines |
added |deleted|filename
-------|-------|--------
EOL
git diff --numstat --no-color -- '*.html' >> commit.txt

0 comments on commit af7512d

Please sign in to comment.