scrape_wikinews

Python script for scraping Wikinews articles.

Setup

Create a Python virtual environment and install the dependencies:

# create a virtual environment
$ python3 -m venv scrape_wikinews
# activate the virtual environment
$ source scrape_wikinews/bin/activate
# install dependencies
(scrape_wikinews) $ pip3 install -r requirements.txt
...

Usage

$ ./scrape_wikinews.py --help
usage: scrape_wikinews.py [-h] --category CATEGORY
                          [--log-level {notset,debug,info,warning,error,critical}]
                          [--log-file LOG_FILE]

Given a Wikinews category page URL, scrape all news articles from that
categroy as JSON lines. JSON lines (.jsonl) are written to stdout, and a log
of the accessed pages is written to stderr. To run, you might redirect stdout
to a .jsonl file and stderr to a .log file such as the following: # scrape all
published articles (this will take a while) $ ./scrape_wikinews.py --log-
file=en-wikinews-published.log --category=Published > en-wikinews-
published.jsonl # scrape only the Health category $ ./scrape_wikinews.py
--log-file=en-wikinews-health.log --category=Health > en-wikinews-health.jsonl

options:
  -h, --help            show this help message and exit
  --category CATEGORY   en.wikinews category (e.g., "Health" or "Published",
                        see also:
                        https://en.wikinews.org/wiki/Special:Categories)
                        (default: None)
  --log-level {notset,debug,info,warning,error,critical}
                        logging level (default: info)
  --log-file LOG_FILE   path where logging messages will be written (default:
                        None)

Run

To run, choose a Wikinews category to scrape from.

You can see all categories here: https://en.wikinews.org/wiki/Special:Categories

For example, to scrape all published categories (10K+ article pages):

(scrape_wikinews) $ ./scrape_wikinews.py --log-file=en-wikinews-published.log --category=Published > en-wikinews-published.jsonl

Or to scrape just the Health category:

(scrape_wikinews) $ ./scrape_wikinews.py --log-file=en-wikinews-health.log --category=Health > en-wikinews-health.jsonl

Since the output is in JSON lines format, you can use jq and datamash to analyze the scraped data:

(scrape_wikinews) $ jq '.text|length' en-wikinews.jsonl | datamash --header-out {min,max,mean,pstdev}\ 1 | column -t
min(field-1)  max(field-1)  mean(field-1)    pstdev(field-1)
1             126956        2108.0609889609  3103.367482477
(scrape_wikinews) $ jq -r '.metadata.wikinews_categories[]|match(".*Category:(?<category>.*)")|.captures[].string' en-wikinews.jsonl | sort | uniq -c | sort -nr | head | column -t
21921  Published
19676  Archived
7755   Politics_and_conflicts
7449   United_States
6959   North_America
5556   Europe
4471   Crime_and_law
3655   Asia
2828   United_Kingdom
2773   Disasters_and_accidents

Issues

This only supports English Wikinews (https://en.wikinews.org) for now.
There may be some issues with cleanly extracting article body texts or metadata in some cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

scrape_wikinews

Setup

Usage

Run

Issues

Files

README.md

Latest commit

History

README.md

File metadata and controls

scrape_wikinews

Setup

Usage

Run

Issues