Rewriting some tools as scrapy spiders #30

bsloan · 2015-01-13T03:25:27Z

For what it's worth, here's an attempt to rewrite some of Ahmia's test tools as Scrapy spiders (#20). The original test tools are unchanged.

Needs review and additional testing to ensure any differences between my own environment and Ahmia production environment don't cause problems, and generally to assess whether these changes are worthwhile. A list of rewritten tools plus use / notes for each is below (commands are executed from ahmia/tools directory).

finder_spider

Collects links to onions and then PUTs the list of links to Ahmia endpoint. To run:

$ scrapy crawl finder_spider

t2w_domain_spider

Iterates a list of tor2web nodes and saves their visited domains lists to disk. To run:

$ scrapy crawl t2w_domain_spider

t2w_filter_spider

Downloads the list of hashes of blacklisted onions from a set of tor2web nodes and saves the blacklist to disk.

$ scrapy crawl t2w_filter_spider

backlink_spider

A backlink counting utility; accepts 4 arguments: target, host, links, count (boolean).

The only required argument is target, which instructs the crawler what URL to search backlinks for. Args are passed in the format expected by scrapy, for example:

$ scrapy crawl backlinkspider -a target="http://www.nytimes.com/" -a count=true

test_services_spider

Tests whether hidden services are online.

$ scrapy crawl test_services_spider

…ework

juhanurmi · 2015-01-13T20:19:14Z

Wau! Sound like very great work and effort! I will test to code as soon as possible :)

Rewriting some tools as scrapy spiders

Brian David Sloan and others added 10 commits December 24, 2014 15:46

refactoring backlinker tool as a scrapy spider. initial commit

e51c910

remove note-to-self comment

c2385ce

rewriting the onion finder script as a scrapy spider

69e9182

remove local testing url

e271cf3

convert get_tor2web_domains script to scrapy

ce89d41

modify backlink spider to print link count separately from logging

c2e2aa4

convert tor2web_filters utility to a scrapy crawler

ab51918

first pass at converting hidden services online tester to scrapy fram…

ff09f4d

…ework

oops. getting rid of silly debug print statement

fe1d1b7

renaming some of the new spider source files

841175f

juhanurmi added a commit that referenced this pull request Jan 20, 2015

Merge pull request #30 from bsloan/master

abcf972

Rewriting some tools as scrapy spiders

juhanurmi merged commit abcf972 into juhanurmi:master Jan 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewriting some tools as scrapy spiders #30

Rewriting some tools as scrapy spiders #30

bsloan commented Jan 13, 2015

juhanurmi commented Jan 13, 2015

Rewriting some tools as scrapy spiders #30

Rewriting some tools as scrapy spiders #30

Conversation

bsloan commented Jan 13, 2015

juhanurmi commented Jan 13, 2015