ENB Curator

A modular pipeline for transforming and curating the Estonian National Bibliography (ENB) into a structured, analysis-ready dataset. This project supports large-scale, reproducible data processing to facilitate research into Estonian bibliographic data.

Authors: Krister Kruusmaa, Peeter Tinits, Laura Nemvalts
Institution: National Library of Estonia
License: MIT

Related publication: [TBA]

If you just want to download the datasets:

Books: https://doi.org/10.5281/zenodo.14083327
Persons: https://doi.org/10.5281/zenodo.14094584

Overview

The ENB Curator pipeline is designed for data transformation in three key stages:

Harvesting - Retrieves MARC21XML records via OAI-PMH directly from the National Library of Estonia.
Conversion - Converts MARC records to a tabular format.
Curation - Applies cleaning, harmonization, and enrichment operations to produce a coherent dataset suitable for data analysis.

This pipeline is meant to be reproducible, modular, and scalable. It is possible to adapt it for different bibliographic datasets or extended with new processing modules as needed.

Requirements

Python 3.8+ (3.9.12 recommended)

Installation

Clone the repository:

git clone https://github.com/RaRa-digiLab/enb-curator.git
cd enb-curator

Install required packages:
```
pip install -r requirements.txt
```

Usage

Running the pipeline

For books:
```
python main.py "enb_books"
```
For persons:
```
python main.py "persons"
```

After a succesful run, you can collect the curated, up-to-date dataset from ./data/curated. The pipeline works with other collections as well (see .config/collections.json for all available metadata collections of the National Library of Estonia). However, the curation module currently only supports the books and persons datasets. Other collections can be harvested and converted, but will be curated as if they were books. This can cause some mismatches and suboptimal decisions in the curating process and we recommend reviewing the relevant functions in the curation module to account for them.

Running parts of the pipeline separately

It is also possible to use just one module of the pipeline (e.g. you want to make changes in the curating script and run it without having do download and convert the dataset again). For example:

python src/curate.py "enb_books"

Adapting and contributing

If you want to adapt the pipeline to your own dataset, you can try running the existing commands on your data files or an OAI access point. If you want to curate a new part of the ENB (like maps or sheet music), you are free to reuse the code. Feel free to contact us, perhaps we can help.

We welcome contributions to improve the pipeline and the quality of the curated dataset! For smaller changes, such as correcting coordinates or updating mappings, it is easiest to make the changes directly in the relevant files located in the ./config directory. This ensures the pipeline uses the correct data in subsequent runs.

If you're familiar with GitHub, feel free to submit a pull request or open an issue. If you're unsure how to do this, don't hesitate to reach out to us - we're happy to help!

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
config		config
data		data
docs		docs
notebooks		notebooks
reports		reports
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ENB Curator

If you just want to download the datasets:

Overview

Requirements

Installation

Usage

Running the pipeline

Running parts of the pipeline separately

Adapting and contributing

About

Releases 2

Packages

Contributors 3

Languages

License

RaRa-digiLab/enb-curator

Folders and files

Latest commit

History

Repository files navigation

ENB Curator

If you just want to download the datasets:

Overview

Requirements

Installation

Usage

Running the pipeline

Running parts of the pipeline separately

Adapting and contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Languages

Packages