███████╗ ██████╗██████╗ █████╗ ██████╗ ██╗ ██╗██████╗ ██████╗ ██████╗
██╔════╝██╔════╝██╔══██╗██╔══██╗██╔══██╗╚██╗ ██╔╝██╔══██╗██╔═══██╗██╔═══██╗
███████╗██║ ██████╔╝███████║██████╔╝ ╚████╔╝ ██║ ██║██║ ██║██║ ██║
╚════██║██║ ██╔══██╗██╔══██║██╔═══╝ ╚██╔╝ ██║ ██║██║ ██║██║ ██║
███████║╚██████╗██║ ██║██║ ██║██║ ██║ ██████╔╝╚██████╔╝╚██████╔╝
╚══════╝ ╚═════╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═════╝
>--------------------------------- Scrapy dappy doo crawler for proxy sites
- Crawls proxy sites for working proxies
- Scrapyd server to initiate crawl and get results
- Retain jobs and logs for recent crawls
# Copy the example environment file to .env
cp .env.example .env
# Build the docker image and run the container
docker-compose up --build --detach
# Run a scrapy crawl job via cli
# docker-compose exec -it scrapyd scrapy crawl <spider_name>
docker-compose exec -it scrapyd scrapy crawl freeproxylist
# Run a scrapy crawl job via scrapyd api
# Scrapyd documentation: https://scrapyd.readthedocs.io/en/latest/api.html#schedule-json
curl http://localhost:6800/schedule.json -d project=scrapydoo -d spider=freeproxylist
Scrapyd API is now available at http://localhost:6800.
- root:
/
- Scrapyd server - jobs:
/jobs
- crawl jobs - items:
/items
- scraped items - logs:
/logs
- spider logs
provided by scrapyd server
- daemonstatus:
/daemonstatus.json
- to check the load status of a service - addversion:
/addversion.json
- to add a new version of a project - schedule:
/schedule.json
- to schedule a spider run - cancel:
/cancel.json
- to cancel a spider run - listprojects:
/listprojects.json
- to list all projects - listversions:
/listversions.json
- to list all versions of a project - listspiders:
/listspiders.json
- to list all spiders of a project - listjobs:
/listjobs.json
- to list all pending, running and finished jobs - delversion:
/delversion.json
- to delete a version of a project - delproject:
/delproject.json
- to delete a project
# Poetry is required for installing and managing dependencies
# https://python-poetry.org/docs/#installation
poetry install
# Run the crawlers
#poetry run scrapy crawl <spider_name>
poetry run scrapy crawl freeproxylist
# Install pre-commit hooks
poetry run pre-commit install
# Formatting (inplace formats code)
poetry run black .
# Linting (and to fix automatically)
poetry run ruff .
poetry run ruff --fix .
# Type checking
poetry run mypy .
Configuration details can be found in pyproject.toml.