Warning

python-webpage-2-txt-web2txt

Introduction

This Python script extracts text content from webpages and saves it to separate files. It utilizes libraries like requests and BeautifulSoup for efficient web scraping and HTML parsing. Useful to create datasheet for ML or AI Training.

Disclaimer

Ethical Use: It's crucial to adhere to the terms of service and robots.txt guidelines of the websites you intend to scrape. Respecting these guidelines ensures responsible and ethical use of this script.
No Warranties: This script is provided "as is" without any express or implied warranties. The author disclaims any liability for any damages arising from its use.

License

This project is licensed under the MIT License. Refer to the LICENSE file for the full terms.

Installation

Clone the Repository:

git clone https://github.com/akumathedynd/python-webpage-2-txt-web2txt.git

Install Dependencies:

Navigate to the project directory and install required libraries using pip:
```
cd python-webpage-2-txt-web2txt
pip install requests beautifulsoup4
```

Usage

1. Prepare a Text File (urls.txt):

Create a file named urls.txt in the project directory. List each website URL you want to scrape on separate lines.

2. Run the Script:

Execute the script using Python:

python main.py

Output

The script creates a directory named output_dir (you can modify this name) to store extracted text files.
Each file is named after the corresponding webpage URL, ensuring traceability.

Warning

This script is intended for educational purposes only. It is designed to help you learn about scripting concepts and explore programming possibilities. It is not intended for production use or any situation where unintended consequences could have a negative impact or any legal issues. The author is not responsible for any damages or hazards caused by its use. And, know about local laws.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

python-webpage-2-txt-web2txt

Warning

About

Releases

Packages

Languages

License

akumathedyn123/python-webpage-2-txt-web2txt

Folders and files

Latest commit

History

Repository files navigation

python-webpage-2-txt-web2txt

Warning

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages