Python project that automates the process of scraping job postings from Indeed using Selenium. This tool navigates through job postings, collects job details, and saves them into structured JSON files for analysis.
- The script
indeed_scraper.py
is executed with the following syntax:python indeed_scraper.py --role "<Job Title>" --location "<Country Name>"
- The script navigates through Indeed's search results for the specified role and location.
- Selenium opens Indeed in a web browser and iterates through all job postings for the given query.
- For each job posting:
- The browser navigates to the posting's detailed page.
- Job details such as Title, Company, Description, Salary, and Employment Type are extracted.
- Random pauses are introduced between scraping job descriptions to simulate human-like behavior.
- For each page of results:
- A JSON file is created with the extracted data.
- The files are saved in a directory named after the current date, location, and role, e.g.,
2025-01-04_Spain_Data Analyst
. - Each JSON file corresponds to a page, using a naming convention like
2025-01-04_Spain_Data Analyst_0.json
.
- Selenium clicks the "Next Page" button to proceed to the next page of results.
- If no "Next Page" button is available, the script terminates for the current role and location.
- The scraper is designed to handle CAPTCHAs:
- When a CAPTCHA is encountered, the script pauses and retries indefinitely until the page is accessible.
The scraper supports these countries:
- 🇪🇸 Spain
- 🇬🇧 United Kingdom
- 🇨🇦 Canada
- 🇩🇪 Germany
- 🇦🇺 Australia
- 🇸🇬 Singapore
- 🇮🇳 India
- 🇨🇴 Colombia
- Developed using Selenium for dynamic web scraping.
- Automatically navigates through job postings and collects data directly from job detail pages.
- Random delays between actions to reduce the likelihood of being flagged as a bot.
- Efficiently handles CAPTCHAs by retrying indefinitely until successful.
- Saves data in a structured JSON format for each page, enabling easy processing.
- Includes mechanisms to avoid bot detection:
- Random user agents.
- Simulated mouse movements.
- Automatic acceptance of cookies.
- Random pauses between interactions.
Follow the steps below to set up the project:
https://github.com/juanludataanalyst/indeed-selenium-scraper.git
cd indeed-selenium-scraper
python -m venv env
source env/bin/activate # On Windows: .\env\Scripts\activate
pip install -r requirements.txt
- Ensure that you have the appropriate WebDriver installed for Selenium (e.g., ChromeDriver or GeckoDriver) that matches your browser version.
- Add the WebDriver's executable file to your system's PATH.
Run the script using the following syntax:
python indeed_scraper.py --role "<Job Title>" --location "<Country Name>"
To scrape job postings for Data Analyst roles in Spain:
python indeed_scraper.py --role "Data Analyst" --location "Spain"
- Data Storage: JSON files for each page are saved in a directory named after the date, role, and location.
- Blocking Avoidance: Introduces random delays and handles CAPTCHAs.
We welcome contributions! Follow these steps:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/your-feature
). - Commit your changes (
git commit -m "Add your message"
). - Push to the branch (
git push origin feature/your-feature
). - Open a pull request.
This project is for educational purposes only. Ensure compliance with Indeed’s terms of service when using this tool.
This project is licensed under the MIT License. See the LICENSE file for details.