Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewriting some tools as scrapy spiders #30

Merged
merged 10 commits into from
Jan 20, 2015
Merged

Conversation

bsloan
Copy link
Contributor

@bsloan bsloan commented Jan 13, 2015

For what it's worth, here's an attempt to rewrite some of Ahmia's test tools as Scrapy spiders (#20). The original test tools are unchanged.

Needs review and additional testing to ensure any differences between my own environment and Ahmia production environment don't cause problems, and generally to assess whether these changes are worthwhile. A list of rewritten tools plus use / notes for each is below (commands are executed from ahmia/tools directory).

  • finder_spider

Collects links to onions and then PUTs the list of links to Ahmia endpoint. To run:

$ scrapy crawl finder_spider

  • t2w_domain_spider

Iterates a list of tor2web nodes and saves their visited domains lists to disk. To run:

$ scrapy crawl t2w_domain_spider

  • t2w_filter_spider

Downloads the list of hashes of blacklisted onions from a set of tor2web nodes and saves the blacklist to disk.

$ scrapy crawl t2w_filter_spider

  • backlink_spider

A backlink counting utility; accepts 4 arguments: target, host, links, count (boolean).

The only required argument is target, which instructs the crawler what URL to search backlinks for. Args are passed in the format expected by scrapy, for example:

$ scrapy crawl backlinkspider -a target="http://www.nytimes.com/" -a count=true

  • test_services_spider

Tests whether hidden services are online.

$ scrapy crawl test_services_spider

@juhanurmi
Copy link
Owner

Wau! Sound like very great work and effort! I will test to code as soon as possible :)

juhanurmi added a commit that referenced this pull request Jan 20, 2015
Rewriting some tools as scrapy spiders
@juhanurmi juhanurmi merged commit abcf972 into juhanurmi:master Jan 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants