Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop new Koza API + general refactoring #163

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Conversation

ptgolden
Copy link
Member

[WIP description]

Patrick Golden added 27 commits January 10, 2025 13:01
Without making any changes to functionality, this separates a koza
configuration into a ReaderConfiguration, TransformConfiguration, and
WriterConfiguration, all contained within a KozaConfiguration.
The big changes are:

  1. Taking in a JSON{,L}ReaderConfig object for all configuration

  2. Defining iteration via `__iter__()` and `yield`
First, replaces the many named parameters with a single CSVReaderConfig
object.

Second, uses `__iter__()` and `yield` to define iteration.

Third, refactors the header consumption and validation code, and wraps
accessing the header in a property on the class.
This adds a new class: KozaRunner, which represents a new way of running
Koza transforms. It is a work in progress and still not at feature
parity with existing transforms.

Essentially, the KozaRunner class takes three parameters:

  1. Data (the data to be transformed)

  2. A function to transform that data, either all at once or row-by-row

  3. A writer that will do something with the transformed output

See the documentation in src/koza/runner.py for more details.
This commit makes multiple changes to koza.io.utils.open_resource

- Adds support for opening tar files.

- Handles archives (zip and tar) in the same way that the old
`file_archive` source configuration did: it assumes all files in an
archive are of the same format (CSV, JSONL, etc.). It will likely be
future work to allow a way to specify that only certain files in an
archive should be handled.

- Adds more robust checking for gzip compression than checking for a
`.gz` extension.

- open_resource() now returns one or more SizedResource objects that
indicate the size of the resource being opened, and a `.tell()` method
that indicates the position being read in that resource. This will be
necessary to add some sort of progress bar in the future.

- Resources downloaded from the Web now use the same logic as local
files to check for compression/archives.

- Importantly, the resources returned by `open_resource` *are not
automatically closed*. This was inconsistent in the previous version. It
is up to the consumer of the function to explicitly close resources.

- Adds more tests for compressed and archival formats.

- Small typing changes for other koza.io.utils functions, adding
Optional where appropriate
This was not working correctly with the discriminated union field
I realized at some point that creating a map from a reader file is just
a type of transform. This change in the configuration makes achieving
that possible.

A map transform is just a transform that relies on two additional
configuration keys: `key` and `values`. To make passing those values in
a YAML config possible, this commit makes it so that any extra fields in
the configuration are parsed into an `extra_fields` field in a
transform.
This makes config creation more lenient. Note that this means it's
possible to have an empty transform. The lack of a transform would be
detected when a KozaRunner is run.
Also remove unnecessary `files=[]` calls, since that is the default as
of eaff691.
This allows a transform to be defined as a module (resolvable from
PATH), e.g. `mypackage.transforms.example_transform`, rather than having
to defined it as a file (`/home/user/code/mypackage/transforms/example_transform.py`)

This allows the possibility of creating generic transforms that can be
packaged, installed, and re-used, without having to track down the
filename of the python file where the transform code is located.
This commit the builds on the changes in a60c607, bfa87d3, and eaff691.
It fully implements the mapping functionality that was present in the
previous method of writing transforms, although with a new API.

Instead of being given a large dict-of-dicts with mappings defined for
terms, a method is passed via the KozaTransform object used in a
transform, where a map lookup is done like so:

    def transform(koza: KozaTransform):
        term = "example"
        mapped_term = koza.lookup(term, "column_b")

...where the map was loaded from a CSV file that might look like this:

    id,column_a,column_b
    example,alias1,alias2

...resulting in mapped_term evaluating to `"alias2"`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant