Google Images Scraper is a Python tool designed to scrape high-resolution images from Google Images based on provided links. It now supports multi-threading for faster scraping. This tool overcomes the limitations of some browser extensions that only download image thumbnails.
-
Clone the repository:
git clone https://github.com/your-username/google-images-scraper.git
-
Navigate to the project directory:
cd google-images-scraper
-
Create the environment:
python -m venv .venv
-
Activate the Virtual Environment:
# For Linux source .venv/bin/activate # For Windows # For Powershell .venv/Scripts/Activate.ps1 # For Command Prompt .venv/Scripts/activate.bat
-
Install the required dependencies:
pip install -r requirements.txt
-
Run the scraper by executing the following command:
python main.py
This script will fetch high-resolution images from Google Images based on the provided links using multi-threading for faster scraping.
You can customize the behavior of the scraper by modifying the config.yaml
file.
sender_email
: The email address used for sending notifications.receiver_email
: The email address to receive notifications.sender_email_password
: The password for the sender's email account.send_email
: Set True or False for sending emails.
Note: If you want to use the email notifications functionality with a Gmail account, it's recommended to generate an App Password instead of using your account password.
search_queries
: List of search queries to use when scraping Google Images. You can add or remove queries as needed.
images_limit
: Set the maximum number of images to download per category.
csv_downloads
: Directory to store CSV files.image_downloads
: Directory to store downloaded images.downloader.py
: Contains class to download images using multi-threading.email_service.py
: Provides functionality for email notifications (if needed).scraper.py
: The main scraper class to initiate the scraping process with multi-threading.config.yaml
: Configuration file to set up email and scraping parameters.link_saver.py
: Handles saving image links.main.py
: The main entry point for running the Google Images Scraper.
In main.py
, an instance of the Scraper
class is created as follows:
sc = Scraper(num_threads=5, show_ui=True)
-
num_threads
: You can customize the number of threads, which represents the total browser instances. More threads generally result in faster scraping, but it may increase resource usage. Adjust this value based on your system's capabilities and requirements. -
show_ui
: Theshow_ui
option determines whether Selenium runs in headless mode or not. When set toTrue
, it shows the browser UI during scraping. When set toFalse
, it runs Selenium in headless mode, which means the browser operates in the background without a visible UI. Choose the appropriate setting based on your preference and needs.
The rest of the process is straightforward:
-
Run the scraper by executing
main.py
:python main.py
-
The scraper will start fetching high-resolution images from Google Images based on the provided links and configurations, using the specified number of threads and UI visibility.
-
Monitor the scraping progress and any notifications sent via email, as configured in
config.yaml
.
Contributions to Google Images Scraper are welcome and encouraged! To contribute, follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make your changes and test thoroughly.
- Commit your changes with descriptive commit messages.
- Push your changes to your fork.
- Open a pull request, explaining the changes you've made.
This project is licensed under the MIT License.