Skip to content

Dataset Implementation Overview

Nic edited this page Jun 1, 2024 · 2 revisions

The CORE rules engine now uses special classes referred to as Dataset Implementations for all dataset manipulation and validation. This abstraction allows the engine to switch between different underlying dataset representations (pandas, dask, etc) without having to update the entire application. This also allows the engine to determine which dataset representation is best suited for validation based on certain dataset characteristics (size, format, etc).

How it works

All functions that require reading or writing a dataset now expect to operate on one or more instances of a Dataset Interface

The Dataset Interface class defines all methods the engine needs for manipulating the underlying dataset representation.

NOTE: Any newly implemented dataset methods must be implemented on all Dataset Interface implementations.

Additionally the engine now includes business rules check operators specific to Dataset Interface implementations. These operators expect to receive and operate on a Dataset Implementation rather than a pandas dataframe.

Caching updates

Because it is costly to read datasets from disk, especially when they are large, we want to keep them in cache as long as possible. In order to avoid evicting datasets when performing less costly operations such as aggregations or retrieving metadata from the CDISC Library, the engine's in memory cache now has a distinct cache for datasets.

Methods that return full datasets will now use the cached_dataset decorator. This decorator adds datasets to the dataset specific cache instance, which has a special size calculation method for calculating the size of the underlying datasets.

Adding a new Dataset Implementation

  1. Create an implementation of Dataset Interface
  2. Update get_dataset_implementation to return the new implementation under the correct conditions.
  3. Be sure to update the unit tests (especially the check operator unit tests) to include tests for the new dataset implementation