This project provides a Python-based web scraper to extract articles from The Hacker News website.
- Install Miniconda with Python 3.8.
- Create and activate a new Conda environment.
- Install required dependencies using
conda
orpip
.
-
Clone the repository:
git clone https://github.com/yotsuba9580/thehackernewsScrapeDemo.git cd thehackernewsScrapeDemo
-
Install dependencies:
pip install -r requirements.txt
-
Set up a proxy if required (e.g., via Clash or any proxy service).
-
Run the Scraper:
python scraper.py
-
Output:
- Articles will be saved to
articles.csv
in the current directory. - The scraper will also save the last processed date to
last_processed_date.txt
for resuming scraping later.
- Articles will be saved to
-
Resuming Scraping:
- The scraper automatically resumes from the last processed date if
last_processed_date.txt
exists. To restart from the beginning, delete this file.
- The scraper automatically resumes from the last processed date if
-
Fetching Pages:
- Uses
requests
to fetch HTML content from The Hacker News, with retries and delay to avoid rate limits.
- Uses
-
Parsing HTML:
- Extracts article titles, URLs, and dates using
lxml
and XPath expressions.
- Extracts article titles, URLs, and dates using
Example of a row in articles.csv
:
Title | Date | URL |
---|---|---|
Rockstar2FA Collapse Fuels Expansion of FlowerStorm Phishing-as-a-Service | Dec 23, 2024 | https://thehackernews.com/2024/12/rockstar2fa-collapse-fuels-expansion-of.html |
-
Ad Titles:
- Update the
ad_titles
list inscraper.py
to add or remove titles considered as ads.
- Update the
-
Proxies:
- Configure the
proxies
dictionary inscraper.py
to use your preferred proxy service.
- Configure the
thehackernewsScrapeDemo/
├── hackerNewsScrape.py # Main scraper script
├── requirements.txt # Dependencies
├── articles.csv # Output file (generated)
├── last_processed_date.txt # Stores last processed date (generated)
└── README.md # Project documentation
This project is designed to scrape articles from The Hacker News, a popular cybersecurity news website. The scraper targets publicly available information such as article titles, URLs, and publication dates.
- This project is for educational and research purposes only. Please ensure your usage complies with The Hacker News Terms of Service.
- The copyright and ownership of all content belong to The Hacker News and its respective authors. This tool is not affiliated with or endorsed by The Hacker News.