Skip to content

A tool for scraping websites and converting content into insightful markdown files using Selenium and a language model for analysis.

License

Notifications You must be signed in to change notification settings

MaharshPatelX/InsightScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InsightScraper

Overview

InsightScraper is a powerful tool designed to scrape websites and convert the content into insightful markdown files. The tool leverages Selenium for web scraping and a language model to generate insightful data from the scraped content.

Features

  • Web Scraping: Scrape entire websites and convert pages to markdown format.
  • Insight Generation: Generate insightful data from the scraped content using a language model.
  • Structured Output: Save the processed markdown files in a structured format.

Installation

Prerequisites

  • Python 3.7 or higher
  • Google Chrome browser
  • ChromeDriver corresponding to your Chrome version

Steps

  1. Clone the repository:
    git clone https://github.com/yourusername/InsightScraper.git
    
    
  2. Navigate to the project directory:
    cd InsightScraper
    
    
  3. Create a virtual environment and activate it:
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
    
  4. Install the dependencies:
    pip install -r requirements.txt
    
    
  5. Configure the project:
    • Update config/config.yaml with the necessary details like the path to ChromeDriver and your Groq API key.

Configuration

Update the configuration file config/config.yaml with your settings:

scraper:
  download_path: "downloads"
  chromedriver_path: "path/to/chromedriver"
  headless: true

insights:
  groq_api_key: "your_groq_api_key"
  model_name: "llama3-70b-8192"

Usage

Web Scraping

To scrape a website, run the following command and provide the base URL when prompted:

python scripts/scrape_website.py

Generating Insights

To generate insights from the scraped markdown files, run:

python scripts/generate_insights.py

Directory Structure

InsightScraper/
├── config/
│   └── config.yaml
├── src/
│   ├── __init__.py
│   ├── scraper/
│   │   ├── __init__.py
│   │   └── website_md.py
│   ├── insights/
│   │   ├── __init__.py
│   │   └── md_to_insightful_data.py
├── tests/
│   ├── __init__.py
│   ├── test_website_md.py
│   └── test_md_to_insightful_data.py
├── scripts/
│   ├── scrape_website.py
│   └── generate_insights.py
├── requirements.txt
├── README.md
├── LICENSE
└── .gitignore

Testing

To run tests, use:

python -m unittest discover tests

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-branch).
  3. Commit your changes (git commit -m 'Add some feature').
  4. Push to the branch (git push origin feature-branch).
  5. Create a new Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A tool for scraping websites and converting content into insightful markdown files using Selenium and a language model for analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages