Google News Scraper is a Python-based project designed to fetch and process Google News articles effortlessly. Whether you're conducting research, tracking the latest news, or performing sentiment analysis, this tool empowers you to extract and analyze news data with ease.
- Customizable News Search: Fetch articles based on your search query and time range.
- Automatic URL Decoding: Decodes Google News redirect URLs to obtain actual article links.
- Article Extraction: Extracts clean text content from the fetched articles.
- NLP Integration: Performs basic Natural Language Processing (NLP) on the extracted data (customizable).
- Progress Tracking: Displays a progress bar with
tqdm
for scraping and processing articles.
To get started, clone the repository and install the required dependencies:
git clone https://github.com/risabhmishra/google-news-scraper.git
cd google-news-scraper
pip3 install -r requirements.txt
Run the scraper with a simple command:
python3 google_news_scraper.py --query "Query Params" --time_delta "Time Delta"
--query
: Search term(s) for Google News (e.g., "AI technology").--time_delta
: Filter articles by age (e.g.,24h
for 24 hours,7d
for 7 days, or120s
for 120 seconds).
- Fetch News Links: Searches Google News for articles matching your query.
- Decode URLs: Decodes Google News redirect links to get the original article URLs.
- Extract Articles: Downloads and extracts clean text content from each article.
- Perform NLP: Applies basic NLP operations (customizable for advanced needs).
- Track Progress: Visualizes scraping and processing progress using a sleek progress bar.
Search for news about "ceinsys tech ltd" from the past 15 days:
python3 google_news_scraper.py --query "ceinsys tech ltd" --time_delta "15d"
Easily extend functionality:
- Advanced NLP: Add sentiment analysis, keyword extraction, or summarization.
- Data Storage: Save results in formats like JSON, CSV, or a database.
- Automation: Schedule periodic scraping tasks with cron jobs or task schedulers.
Enjoy a smooth user experience with a detailed progress bar powered by tqdm
:
[#### ] 25% Decoding URLs
[
{
"title": "Ceinsys Tech achieves breakthrough in geospatial technology",
"url": "https://example.com/ceinsys-tech",
"content": "Ceinsys Tech Ltd has revolutionized geospatial solutions...",
"timestamp": "2024-01-01"
},
...
]
We welcome contributions to enhance this project! Check out our CONTRIBUTING.md for guidelines.
- No need to manually search for and decode Google News links.
- Lightweight and customizable for a wide range of use cases.
- Perfect for news aggregation, research, and analytics.
This project is licensed under the MIT License.