This project is a Scrapy-based web scraper designed to scrape quotes from quotes.toscrape.com with proxy rotation support using the scrapy-rotating-proxies
middleware. The proxy rotation helps in avoiding IP bans and throttling by the target website.
Make sure you have Python 3.x installed along with Scrapy. To install Scrapy, use:
pip install scrapy
Additionally, install the scrapy-rotating-proxies
middleware:
pip install scrapy-rotating-proxies
proxy_rotation/
├── proxy_rotation/
│ ├── __init__.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders/
│ ├── __init__.py
│ └── quotes_proxy_rotation_spider.py
├── scrapy.cfg
└── README.md
Clone the repository and navigate to the project directory:
git clone https://github.com/yourusername/proxy_rotation.git
cd proxy_rotation
-
Proxies Setup:
Update the list of proxies in the
settings.py
file:ROTATING_PROXY_LIST = [ # if proxy need authentication then list will be in below format 'http://username:password@proxy1.com:8000', # list without authentication # 'ip:port' 'proxy1.com:8000' # Add more proxies as needed ]
- If your proxies require authentication, use the format
http://username:password@proxy:port
. - If your proxies do not require authentication, use the format
ip:port
.
- If your proxies require authentication, use the format
-
Spider Setup:
The spider is located in
spiders/quotes_proxy_rotation_spider.py
. It scrapes quotes and author details from the website. -
Run the Spider:
Run the spider using the following command:
scrapy crawl quotes_proxy_rotation_spider
-
Proxy Rotation: Uses multiple proxies to avoid IP bans and throttling by rotating through a list of proxies.
-
Robust Scraping: Handles pagination and extracts detailed information about quotes and authors.
This project is licensed under the MIT License - see the LICENSE file for details.