Data management

Discussion paper

Add comments like this:

Kai: I think this is really good!

Current state

There are only the official datasets, described by markdown files with yaml header: https://github.com/wisslab/judaicalink-site/tree/master/content/datasets

Rationale: This is great for long-term preservation, as only static files are needed to keep JudaicaLink up.

This is an example of the currently available metadata:

+++
author = "Maral Dadvar"
authorlink = "http://wiss.iuk.hdm-stuttgart.de/people/maral-dadvar"
date = "2018-10-04T10:37:07+02:00"
title = "Geographical Coordinates" 
dataslug = "geo-coordinates"
graph = "http://data.judaicalink.org/data/geo-coor"
loaded = true
category = "judaicalink"
example = "http://data.judaicalink.org/data/Aachen"
[[files]]
	url = "http://data.judaicalink.org/dumps/geo-coordinates/current/city-geocoor-05.ttl.gz"		
	description = "Geographical Coordinates of Cities"	
+++

Problems:

We have three identifiers here: name of the file (in this case geo-coordinates.md), dataslug and graph.
We have no clear naming scheme or even directory structure for the data files.
data files have no name (do they need one?)
"Loaded" flag is probably better maintained somewhere else. Currently it is used by our deployment scripts to decide which data is loaded in the official endpoint. But then there is also the "loaded" and "indexed" flags in the labs database.
classification of non-rdf data is insufficient. We at least have RDF, ElasticSearch, Beacon and probably some sort of full text format soon.
Do we need a classification in original data and enrichment? Manual vs. automatic?
Is there always one and only one slug per dataset?
- Interlink and enrichment Datasets have statements about URIs with many different slugs.
- Possible rule: The slug indicates in which dataset a URI has been created for the first time.
- Any use case for more than one slug?
- Any use case for one slug used in several datasets for URI coining?

Kai: One slug per dataset will make problems, e.g. when someone want to provide a dataset also based on GND or DBpedia. So we will have to go for a list of slugs per dataset and one slug can occur in several datasets. This means: slug != dataset-identfier.

Where goes the provenance? Yaml metadata, RDF, RDF within the files or additionally, provenance on file level or only on dataset level?

Going beyond these datasets:

We need "local" datasets in labs, i.e. a way to introduce and use datasets without necessarily publishing them on the main site.
Maybe: support multiple sites, so that researchers could create and publish their own datasets?

Proposals for new structure

There is only one identifier per dataset which is used for filename, slug and graph.
Removal of loaded flag (Can we replace it with some javascript lookup using labs backend?)
Strict filename policy, components:
- dataset slug,
- name,
- optional number or subname
Ex: geo-coordinates/city-05.ttl.gz (allowed filetypes for rdf: ttl, nt, ttl.gz, nt.gz, all are turtle compatible)
Should we repeat the dataset slug in the filename to make it globally unique?
Should we enforce a clear separator, e.g. underscore between elements, so that dashes can be used in components and the slug?
We drop the version (current) for now, as we do not use it. We can still add directories for older versions if we need it.

Ideas for local datasets

All data is generated with commands. We need good library support so that all commands generate all files according to our structure and directly generate provenance information.
There can be local datasets. All datasets are maintained in the Labs database, i.e. via Django.
Datasets, especially local datasets, can be exported as Markdown/Yaml files.
File structure should be synchronised on server and local, i.e., no http_data_judaica... names. Instead, we maintain a list of mirrors where data files can be found, the local mirror would be the configurable directory used by Django.

Judaicalink - https://www.judaicalink.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data management

Discussion paper

Current state

Proposals for new structure

Ideas for local datasets

Clone this wiki locally