Sinagot is a Python library to manage data processing with scripts on a dataset. Sinagot is able to batch scripts runs with a simple API. Parallelization of data processing is possible with Dask.distributed.
Sinagot is available on PyPi :
pip install sinagot
https://sinagot.readthedocs.io/en/latest/
Dataset are structured around some core concept : record, subset, task, modality and script. A record, identified by its unique ID, correspond to a recording session where experimental tasks are performed generating data of various modalities. Raw data of a record are processed with scripts to generate more useful data.
The idea of Sinagot emerged for the data management of an EEG platform called SoNeTAA. For documentation purpose SoNeTAA dataset structure will be used as example.
On SoNeTAA, a record with an ID with timestamp info in this format REC-[YYMMDD]-[A-Z], for example "REC-200331-A".
For a record, 3 tasks are performed : "RS", "MMN" and "HDC", 2 main modalities handle data for every tasks: "EEG" and "clinical", and a third one "behavior" exists only for HDC.
Import Dataset class
>>> from sinagot import Dataset
A Dataset
instance need 3 things :
- A config file in toml format.
- A folder containing the dataset
- A folder containing all the scripts
To instantiate a dataset use the config file path as argument :
>>> ds = Dataset('/path/to/conf')
>>> ds
<Dataset instance | task: None, modality: None>
Be sure that dataset path and scripts path are correctly set in the config file
You can list all records ids :
>>> for id in ds.ids():
... print(id)
...
REC-200331-A
REC-200331-B
Create a Record
instance. For a specific record :
>>> rec = ds.get('REC-200331-A')
>>> rec
<Record instance | id: REC-200331-A, task: None, modality: None>
Or the first record found :
>>> ds.first()
<Record instance | id: REC-200331-B, task: None, modality: None>
Records are not sort by their ids.
You can run all scripts for each record of the dataset :
>>> ds.run()
2020-03-31 16:03:58,869 : Begin step run
...
2020-03-31 16:03:58,869 : Step run finished
Or for a single record :
>>> rec.run()
2020-03-31 16:06:57,313 : Begin step run
...
2020-03-31 16:06:57,314 : Step run finished
Each dataset or record has subscopes corresponding to their tasks and modalities simply accessible by self attributes with the scope name.
For example to select only the task RS of the dataset :
>>> ds.RS
<Subset instance | task: RS, modality: None>
A dataset subscope is a subset.
Or the EEG modality of a record :
>>> rec.EEG
<Record instance | id: REC-200331-A, task: None, modality: EEG>
You can select a specific couple of task and modality (called unit) :
>>> ds.RS.EEG
<Subset instance | task: RS, modality: EEG>
>>> ds.EEG.RS
<Subset instance | task: RS, modality: EEG>