Skip to content

Commit

Permalink
Merge pull request #15 from cfpb/reject-not-accept
Browse files Browse the repository at this point in the history
Reject non-HTML instead of accepting only HTML
  • Loading branch information
chosak authored Nov 2, 2020
2 parents 25a6211 + c77fa97 commit 6ef315c
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion crawl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ time wget \
--execute robots=off \
--follow-tags=a \
--limit-rate=1m \
--accept html \
--reject '*.css,*.csv,*.CSV,*.doc,*.docx,*.epub,*.gif,*.ico,*.jpg,*.js,*.mp3,*.pdf,*.PDF,*.png,*.pptx,*.tmp,*.txt,*.wav,*.woff,*.woff2,*.xls,*xlsx,*.xml,*.zip' \
--reject-regex "topics=|authors=|categories=|filter_blog_category=|ext_url=|search_field=|issuer_name=" \
--recursive \
--level="$depth" \
Expand Down

0 comments on commit 6ef315c

Please sign in to comment.