WikiUtils

A set of utility scripts to process Wikipedia related data

parse_mysqldump

A script for parsing wikipedia mysqldump sql.gz files. Can be extended to parse arbitraty mysqldump files.

usage: parse_mysqldump.py [-h] [--column-indexes COLUMN_INDEXES]
                          filename filetype outputfile

positional arguments:
  filename              name of the wikipedia sql.gz file.
  filetype              following filetypes are supported: [categorylinks,
                        pagelinks, redirect, category, page_props, page]
  outputfile            name of the output file

optional arguments:
  -h, --help            show this help message and exit
  --column-indexes COLUMN_INDEXES, -c COLUMN_INDEXES
                        column indexes to use in output file

Inspecting dump files in BASH

Run the following command to parse the dump files in bash. The first and last lines will have some non column information.

zcat enwiki-20170920-categorylinks.sql.gz | grep $'^INSERT INTO ' | sed 's/),(/\n/g' | less -N

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
parse_mysqldump.py		parse_mysqldump.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiUtils

parse_mysqldump

Inspecting dump files in BASH

About

Releases

Packages

Languages

License

napsternxg/WikiUtils

Folders and files

Latest commit

History

Repository files navigation

WikiUtils

parse_mysqldump

Inspecting dump files in BASH

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages