Skip to content

schneiderkamplab/synthesizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthesizers

A meta library for synthetic data generation.

The goal of synthesizers is to simplify the use of existing frameworks for synthethic data generation:

  • All basic operations are available as functional and pipeline abstractions that transform states.
  • States keep track of datasets, models, and evaluation results.
  • A meta pipeline allows for very simple but expressive synthetic data generation.
  • Datasets are read from CSV, TSV, JSON, JSONL, Python Pickle (.pickle), and Excel (.xlsx) files.
  • Datasets can be downloaded from the Huggingface Hub.
  • States including datasets and models can be saved and loaded from disk.
  • Datasets can be converted between list, Numpy, Pandas, and Huggingface datasets formats.
  • Datasets are automatically converted to the input format of synthesis or evaluation backends.

Installation

Simply install synthesizers using pip from PyPI:

pip install synthesizers

If you clone or downloaded the source code, you can also install it from the root directory of the repository:

pip install .

Or download and install directly from the terminal:

pip install https://github.com/schneiderkamplab/synthesizers/archive/refs/heads/main.zip

To ensure the right dependencies, it is often preferable to create a virtual environment (here the directory venv in the current directory):

python -m virtualenv venv
. venv/activate
pip install synthesizers

Conda is a popular alternative:

conda create -n synthesizers python=3.11
conda activate synthesizers
pip install synthesizers

Usage

Functional abstraction

The functional abstraction manipulates states that can be initalized by the pre-defined Load object and manipulated by functions such as the meta function Synthesize:

from synthesizers import Load
Load("mstz/breast").Synthesize(split_size=0.8, gen_count=10000, eval_target_col="is_cancer", save_name="breast.xlsx", save_key="synth")

In this case, Load loads a dataset on breast cancer fromt the Huggingface Hub, resulting in a state containing just a train dataset. This state is then expanded by the Synthesize function, which splits the train dataset into train and test datasets, trains a GAN model, generates a synth dataset, computes eval information, and saves the synthetic data to an Excel file.

The meta function Synthesize can be broken up into separate functions for the individual steps:

from synthesizers import Load
Load("mstz/breast").Split(size=0.8).Train().Generate(count=10000).Evaluate(target_col="is_cancer").Save(name="breast.xlsx", key="synth")

This version can be used to resuse intermediate states, e.g., to generate and save synthetic datasets of different sizes reusing the same trained model:

from synthesizers import Load
state = Load("mstz/breast").Split(size=0.8).Train()
for count in (100, 1000, 10000, 100000):
    state.Generate(count=count).Save(name=f"breast-{count}.csv", key="synth")

It is also useful when it is necessary to store the intermediate state to the file system:

from synthesizers import Load
state = Load("mstz/breast").Split(size=0.8).Train().Save("breast_state")

The saved state can be loaded and resumed as one might expect:

from synthesizers import Load
Load("breast_state").Generate(count=10000).Save(name="breast.csv", key="synth")

The count parameter can be a list or another iterable sequence, indicating that multiple synthetic sets be created. The following code will save two synthetic datasets to breast_1000.csv and breast_100000.csv:

from synthesizers import Load
Load("breast_state").Generate(count=[1000,100000]).Save(name="breast_1000.csv", index=0, key="synth").Save(name="breast_100000.csv", index=1, key="synth")

Multiple parameters are also allowed for the plugin parameter of Train and the size parameter of Split.

Furthermore, the Load function takes either a single dataset or a tuple of such datasets. With the help of the optional jobs parameter (with variants train_jobs, eval_jobs etc.) parameter, the number of concurrent processes can be set. In the following example, we generate synthetic versions of two different splits of two different datasets:

from synthesizers import Load
Load(("mstz/titanic","mstz/breast")).Synthesize(split_size=[0.5,0.8], train_jobs=4, do_eval=False).Save("mstz")

Pipeline abstraction

Internally, the functional abstraction instantiates pipelines to accomplish its functionality. These pipelines can be used as an expressive alternative. Here is a usage example with the synthesis meta pipeline, which again loads the breast cancer dataset from the Huggingface Hub, trains a GAN model, synthesizes 10,000 synthetic records, evaluates it, and saves it as a JSON file:

from synthesizers import pipeline
pipeline("synthesize", split_size=0.8, gen_count=10000, eval_target_col="is_cancer", save_name="breast.json", save_key="synth")("mstz/breast")

The meta pipeline pools the functionality of multiple base pipelines. The same functionality as in the above example might be accomplished with base pipelines:

from synthesizers import pipeline
state = pipeline("split", size=0.8)("mstz/breast")
state = pipeline("train")(state)
state = pipeline("generate", count=10000)(state)
state = pipeline("evaluate", target_col="is_cancer")
state = pipeline("identity", save_name="breast.json", save_key="synth")

Pipelines are exposed not only as an internal representation but provide the ability to reuse settings, e.g., by having a pipeline for training CTGANs. The following example also illustrates that functional and pipeline abstractions can readily be combined as preferred by the user:

from synthesizers import Load, pipeline
s1 = Load("mstz/breast").Split()
s2 = Load("julien-c/titanic-survival").Split()
train = pipeline("train", plugin="ctgan")
train(s1).Generate(count=1000).Save(name="breast.jsonl", key="synth")
train(s2).Generate(count=1000).Save(name="titanic.jsonl", key="synth")

The plugins depend on the backend used. The standard backend for generation is synthcity, which offers a variety of plugins including adsgan, ctgan, tvae, and bayesian_network. For evaluation, the standard backend is SynthEval.

Ideas for future development

  • add possibility to allow methods from multiple backenders by allowing multiple adapters (mapping method name to adapter)
  • make sure all parameters can be iterables/sequences where it makes sense (e.g. target_col)
  • check argument validity before running pipeline
  • improved error handling (e.g. evaluating without synth dataset, training without train dataset etc.)
  • add source and meta to StateDict with initial data source and parameters to reproduce
  • revamp loading saving to a more useful format, e.g., pickle everything to one file instead of directories
  • implement overwrite parameter to State with Load(overwrite=...), three values:
    • copy: add new state if a value would be overwritten
    • overwrite: just overwrite the value
    • raise: raise an error if a value would be overwritten
  • implement TabularSynthesisDPPipeline
  • use benchmark module from syntheval?
  • standardized list of supported metrics (supported by any backend)
  • standardized list of supported generation methods (supported by any backend)
  • accumulation of multiple outputs (model, synth, and eval as lists)
  • select and combine evaluation backends automatically for given list of metrics
  • select generation backend automatically for given generation method
  • make syntheval plots available as PIL images
  • push_to_hub method on models a la https://github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py
  • push_to_hub method on datasets
  • R synthpop as backend
  • integration of other backends
  • Put string options as literals so they are more visible in tooltips
  • Docstrings for all modules used in the examples

About

meta library for synthetic data generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages