Introduction
This Python script extracts text content from webpages and saves it to separate files. It utilizes libraries like requests
and BeautifulSoup
for efficient web scraping and HTML parsing. Useful to create datasheet for ML or AI Training.
Disclaimer
- Ethical Use: It's crucial to adhere to the terms of service and robots.txt guidelines of the websites you intend to scrape. Respecting these guidelines ensures responsible and ethical use of this script.
- No Warranties: This script is provided "as is" without any express or implied warranties. The author disclaims any liability for any damages arising from its use.
License
This project is licensed under the MIT License. Refer to the LICENSE
file for the full terms.
Installation
-
Clone the Repository:
git clone https://github.com/akumathedynd/python-webpage-2-txt-web2txt.git
-
Install Dependencies:
Navigate to the project directory and install required libraries using pip:
cd python-webpage-2-txt-web2txt pip install requests beautifulsoup4
Usage
1. Prepare a Text File (urls.txt):
Create a file named urls.txt
in the project directory. List each website URL you want to scrape on separate lines.
2. Run the Script:
Execute the script using Python:
python main.py
Output
- The script creates a directory named
output_dir
(you can modify this name) to store extracted text files. - Each file is named after the corresponding webpage URL, ensuring traceability.