Skip to content

Commit

Permalink
Reject non-HTML instead of accepting only HTML
Browse files Browse the repository at this point in the history
Trying to accept only files that end in .html causes problems when:

1. Links on a page don't end in a trailing slash (e.g. /foo/bar), and
wget interprets the link of being of type "bar", and thus rejects it.
2. Long URLs get truncated when saved as files and thus don't end in
.html. These get deleted by wget.

This change restores old behavior that provided an explicit rejectlist
instead of only accepting html. This is a little suboptimal; it would be
nice not to have to list out a potentially-ever-growing list of file
extensions, but I'm not sure of a better way to accomplish what we want.
  • Loading branch information
chosak committed Nov 2, 2020
1 parent af7512d commit 3a07ed0
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion crawl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ time wget \
--execute robots=off \
--follow-tags=a \
--limit-rate=1m \
--accept html \
--reject '*.css,*.doc,*.docx,*.epub,*.gif,*.ico,*.jpg,*.js,*.mp3,*.PDF,*.pdf,*.png,*.pptx,*.tmp,*.txt,*.wav,*.woff,*.woff2,*.xls,*xlsx,*.xml,*.zip' \
--reject-regex "topics=|authors=|categories=|filter_blog_category=|ext_url=|search_field=|issuer_name=" \
--recursive \
--level="$depth" \
Expand Down

0 comments on commit 3a07ed0

Please sign in to comment.