Web Scraper for The Hacker News

This project provides a Python-based web scraper to extract articles from The Hacker News website.

Prerequisites

Install Miniconda with Python 3.8.
Create and activate a new Conda environment.
Install required dependencies using conda or pip.

Installation

Clone the repository:

git clone https://github.com/yotsuba9580/thehackernewsScrapeDemo.git
cd thehackernewsScrapeDemo

Install dependencies:
```
pip install -r requirements.txt
```
Set up a proxy if required (e.g., via Clash or any proxy service).

Usage

Run the Scraper:
```
python scraper.py
```
Output:
- Articles will be saved to articles.csv in the current directory.
- The scraper will also save the last processed date to last_processed_date.txt for resuming scraping later.
Resuming Scraping:
- The scraper automatically resumes from the last processed date if last_processed_date.txt exists. To restart from the beginning, delete this file.

How It Works

Fetching Pages:
- Uses requests to fetch HTML content from The Hacker News, with retries and delay to avoid rate limits.
Parsing HTML:
- Extracts article titles, URLs, and dates using lxml and XPath expressions.

Example Output

Example of a row in articles.csv:

Title	Date	URL
Rockstar2FA Collapse Fuels Expansion of FlowerStorm Phishing-as-a-Service	Dec 23, 2024	https://thehackernews.com/2024/12/rockstar2fa-collapse-fuels-expansion-of.html

Customization

Ad Titles:
- Update the ad_titles list in scraper.py to add or remove titles considered as ads.
Proxies:
- Configure the proxies dictionary in scraper.py to use your preferred proxy service.

Project Structure

thehackernewsScrapeDemo/
├── hackerNewsScrape.py         # Main scraper script
├── requirements.txt            # Dependencies
├── articles.csv                # Output file (generated)
├── last_processed_date.txt     # Stores last processed date (generated)
└── README.md                   # Project documentation

Data Source

This project is designed to scrape articles from The Hacker News, a popular cybersecurity news website. The scraper targets publicly available information such as article titles, URLs, and publication dates.

Disclaimer

This project is for educational and research purposes only. Please ensure your usage complies with The Hacker News Terms of Service.
The copyright and ownership of all content belong to The Hacker News and its respective authors. This tool is not affiliated with or endorsed by The Hacker News.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper for The Hacker News

Prerequisites

Installation

Usage

How It Works

Example Output

Customization

Project Structure

Data Source

Disclaimer

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
LICENSE		LICENSE
README.md		README.md
articleScrape.py		articleScrape.py
articles.csv		articles.csv
hackerNewsScrape.py		hackerNewsScrape.py
last_processed_date.txt		last_processed_date.txt
requirements.txt		requirements.txt

License

yotsuba9580/thehackernewsScrapeDemo

Folders and files

Latest commit

History

Repository files navigation

Web Scraper for The Hacker News

Prerequisites

Installation

Usage

How It Works

Example Output

Customization

Project Structure

Data Source

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages