Skip to content

This Python script extracts text content from webpages and saves it to separate files. It utilizes libraries like requests and BeautifulSoup for efficient web scraping and HTML parsing.

License

Notifications You must be signed in to change notification settings

akumathedyn123/python-webpage-2-txt-web2txt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

python-webpage-2-txt-web2txt

Introduction

This Python script extracts text content from webpages and saves it to separate files. It utilizes libraries like requests and BeautifulSoup for efficient web scraping and HTML parsing. Useful to create datasheet for ML or AI Training.

Disclaimer

  • Ethical Use: It's crucial to adhere to the terms of service and robots.txt guidelines of the websites you intend to scrape. Respecting these guidelines ensures responsible and ethical use of this script.
  • No Warranties: This script is provided "as is" without any express or implied warranties. The author disclaims any liability for any damages arising from its use.

License

This project is licensed under the MIT License. Refer to the LICENSE file for the full terms.

Installation

  1. Clone the Repository:

    git clone https://github.com/akumathedynd/python-webpage-2-txt-web2txt.git
  2. Install Dependencies:

    Navigate to the project directory and install required libraries using pip:

    cd python-webpage-2-txt-web2txt
    pip install requests beautifulsoup4

Usage

1. Prepare a Text File (urls.txt):

Create a file named urls.txt in the project directory. List each website URL you want to scrape on separate lines.

2. Run the Script:

Execute the script using Python:

python main.py

Output

  • The script creates a directory named output_dir (you can modify this name) to store extracted text files.
  • Each file is named after the corresponding webpage URL, ensuring traceability.

Warning

This script is intended for educational purposes only. It is designed to help you learn about scripting concepts and explore programming possibilities. It is not intended for production use or any situation where unintended consequences could have a negative impact or any legal issues. The author is not responsible for any damages or hazards caused by its use. And, know about local laws.