InsightScraper is a powerful tool designed to scrape websites and convert the content into insightful markdown files. The tool leverages Selenium for web scraping and a language model to generate insightful data from the scraped content.
- Web Scraping: Scrape entire websites and convert pages to markdown format.
- Insight Generation: Generate insightful data from the scraped content using a language model.
- Structured Output: Save the processed markdown files in a structured format.
- Python 3.7 or higher
- Google Chrome browser
- ChromeDriver corresponding to your Chrome version
- Clone the repository:
git clone https://github.com/yourusername/InsightScraper.git
- Navigate to the project directory:
cd InsightScraper
- Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the dependencies:
pip install -r requirements.txt
- Configure the project:
- Update config/config.yaml with the necessary details like the path to ChromeDriver and your Groq API key.
Update the configuration file config/config.yaml with your settings:
scraper:
download_path: "downloads"
chromedriver_path: "path/to/chromedriver"
headless: true
insights:
groq_api_key: "your_groq_api_key"
model_name: "llama3-70b-8192"
To scrape a website, run the following command and provide the base URL when prompted:
python scripts/scrape_website.py
To generate insights from the scraped markdown files, run:
python scripts/generate_insights.py
InsightScraper/
├── config/
│ └── config.yaml
├── src/
│ ├── __init__.py
│ ├── scraper/
│ │ ├── __init__.py
│ │ └── website_md.py
│ ├── insights/
│ │ ├── __init__.py
│ │ └── md_to_insightful_data.py
├── tests/
│ ├── __init__.py
│ ├── test_website_md.py
│ └── test_md_to_insightful_data.py
├── scripts/
│ ├── scrape_website.py
│ └── generate_insights.py
├── requirements.txt
├── README.md
├── LICENSE
└── .gitignore
To run tests, use:
python -m unittest discover tests
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Commit your changes (
git commit -m 'Add some feature'
). - Push to the branch (
git push origin feature-branch
). - Create a new Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.