Develop new Koza API + general refactoring #163

ptgolden · 2025-01-14T21:15:21Z

[WIP description]

Without making any changes to functionality, this separates a koza configuration into a ReaderConfiguration, TransformConfiguration, and WriterConfiguration, all contained within a KozaConfiguration.

The big changes are: 1. Taking in a JSON{,L}ReaderConfig object for all configuration 2. Defining iteration via `__iter__()` and `yield`

First, replaces the many named parameters with a single CSVReaderConfig object. Second, uses `__iter__()` and `yield` to define iteration. Third, refactors the header consumption and validation code, and wraps accessing the header in a property on the class.

This adds a new class: KozaRunner, which represents a new way of running Koza transforms. It is a work in progress and still not at feature parity with existing transforms. Essentially, the KozaRunner class takes three parameters: 1. Data (the data to be transformed) 2. A function to transform that data, either all at once or row-by-row 3. A writer that will do something with the transformed output See the documentation in src/koza/runner.py for more details.

This commit makes multiple changes to koza.io.utils.open_resource - Adds support for opening tar files. - Handles archives (zip and tar) in the same way that the old `file_archive` source configuration did: it assumes all files in an archive are of the same format (CSV, JSONL, etc.). It will likely be future work to allow a way to specify that only certain files in an archive should be handled. - Adds more robust checking for gzip compression than checking for a `.gz` extension. - open_resource() now returns one or more SizedResource objects that indicate the size of the resource being opened, and a `.tell()` method that indicates the position being read in that resource. This will be necessary to add some sort of progress bar in the future. - Resources downloaded from the Web now use the same logic as local files to check for compression/archives. - Importantly, the resources returned by `open_resource` *are not automatically closed*. This was inconsistent in the previous version. It is up to the consumer of the function to explicitly close resources. - Adds more tests for compressed and archival formats. - Small typing changes for other koza.io.utils functions, adding Optional where appropriate

This was not working correctly with the discriminated union field

I realized at some point that creating a map from a reader file is just a type of transform. This change in the configuration makes achieving that possible. A map transform is just a transform that relies on two additional configuration keys: `key` and `values`. To make passing those values in a YAML config possible, this commit makes it so that any extra fields in the configuration are parsed into an `extra_fields` field in a transform.

This makes config creation more lenient. Note that this means it's possible to have an empty transform. The lack of a transform would be detected when a KozaRunner is run.

Also remove unnecessary `files=[]` calls, since that is the default as of eaff691.

This allows a transform to be defined as a module (resolvable from PATH), e.g. `mypackage.transforms.example_transform`, rather than having to defined it as a file (`/home/user/code/mypackage/transforms/example_transform.py`) This allows the possibility of creating generic transforms that can be packaged, installed, and re-used, without having to track down the filename of the python file where the transform code is located.

Addresses #137

This commit the builds on the changes in a60c607, bfa87d3, and eaff691. It fully implements the mapping functionality that was present in the previous method of writing transforms, although with a new API. Instead of being given a large dict-of-dicts with mappings defined for terms, a method is passed via the KozaTransform object used in a transform, where a map lookup is done like so: def transform(koza: KozaTransform): term = "example" mapped_term = koza.lookup(term, "column_b") ...where the map was loaded from a CSV file that might look like this: id,column_a,column_b example,alias1,alias2 ...resulting in mapped_term evaluating to `"alias2"`.

Patrick Golden added 27 commits January 10, 2025 13:01

Refactor main koza configuration file

db4fc50

Without making any changes to functionality, this separates a koza configuration into a ReaderConfiguration, TransformConfiguration, and WriterConfiguration, all contained within a KozaConfiguration.

Refactor JSON{,L} readers

88f40e8

The big changes are: 1. Taking in a JSON{,L}ReaderConfig object for all configuration 2. Defining iteration via `__iter__()` and `yield`

Refactor the CSV reader

cdbc6f6

First, replaces the many named parameters with a single CSVReaderConfig object. Second, uses `__iter__()` and `yield` to define iteration. Third, refactors the header consumption and validation code, and wraps accessing the header in a property on the class.

Refactor writers to use WriterConfig rather than named parameters

0f0bd6b

Move metadata from transform config to KozaConfig

ce1a47e

Add ability to override output format when creating a KozaRunner

caa10f4

Make CSV the default type of reader

cb3e0c6

This was not working correctly with the discriminated union field

Add a PassthroughWriter that simply store written data from a transform

0ceb760

Add back missing header_delimiter option in CSVReaderConfig

73fa492

Provide defaults for all Config options

c12442a

This makes config creation more lenient. Note that this means it's possible to have an empty transform. The lack of a transform would be detected when a KozaRunner is run.

Add tests to CSVReader

939769e

Also remove unnecessary `files=[]` calls, since that is the default as of eaff691.

Add ability to override configuration fields from a transform YAML

b5617a3

Addresses #137

Move transform integration tests to new API

0d98af1

Remove unused tests and test configuration using old API

df7baa2

Fix filter tests

2bcb4e2

Re-implement row_limit

fcf8314

Fix testing for sources with multiple files

cadffde

Remove unused MapDict class

626979f

Remove unused TranslationTable class

645835c

Breakup model/config/source_config.py into smaller modules

9cc6678

Move CLI to new API

1c2d8fb

Add a bit more logging to the runner to match messages in cli_utils.py

a8525fe

ptgolden mentioned this pull request Jan 14, 2025

What is the purpose of "is not None" in the the transform.py idiom? #157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop new Koza API + general refactoring #163

Develop new Koza API + general refactoring #163

ptgolden commented Jan 14, 2025

Develop new Koza API + general refactoring #163

Are you sure you want to change the base?

Develop new Koza API + general refactoring #163

Conversation

ptgolden commented Jan 14, 2025