-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop new Koza API + general refactoring #163
Open
ptgolden
wants to merge
27
commits into
main
Choose a base branch
from
koza-api-new
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Without making any changes to functionality, this separates a koza configuration into a ReaderConfiguration, TransformConfiguration, and WriterConfiguration, all contained within a KozaConfiguration.
The big changes are: 1. Taking in a JSON{,L}ReaderConfig object for all configuration 2. Defining iteration via `__iter__()` and `yield`
First, replaces the many named parameters with a single CSVReaderConfig object. Second, uses `__iter__()` and `yield` to define iteration. Third, refactors the header consumption and validation code, and wraps accessing the header in a property on the class.
This adds a new class: KozaRunner, which represents a new way of running Koza transforms. It is a work in progress and still not at feature parity with existing transforms. Essentially, the KozaRunner class takes three parameters: 1. Data (the data to be transformed) 2. A function to transform that data, either all at once or row-by-row 3. A writer that will do something with the transformed output See the documentation in src/koza/runner.py for more details.
This commit makes multiple changes to koza.io.utils.open_resource - Adds support for opening tar files. - Handles archives (zip and tar) in the same way that the old `file_archive` source configuration did: it assumes all files in an archive are of the same format (CSV, JSONL, etc.). It will likely be future work to allow a way to specify that only certain files in an archive should be handled. - Adds more robust checking for gzip compression than checking for a `.gz` extension. - open_resource() now returns one or more SizedResource objects that indicate the size of the resource being opened, and a `.tell()` method that indicates the position being read in that resource. This will be necessary to add some sort of progress bar in the future. - Resources downloaded from the Web now use the same logic as local files to check for compression/archives. - Importantly, the resources returned by `open_resource` *are not automatically closed*. This was inconsistent in the previous version. It is up to the consumer of the function to explicitly close resources. - Adds more tests for compressed and archival formats. - Small typing changes for other koza.io.utils functions, adding Optional where appropriate
This was not working correctly with the discriminated union field
I realized at some point that creating a map from a reader file is just a type of transform. This change in the configuration makes achieving that possible. A map transform is just a transform that relies on two additional configuration keys: `key` and `values`. To make passing those values in a YAML config possible, this commit makes it so that any extra fields in the configuration are parsed into an `extra_fields` field in a transform.
This makes config creation more lenient. Note that this means it's possible to have an empty transform. The lack of a transform would be detected when a KozaRunner is run.
Also remove unnecessary `files=[]` calls, since that is the default as of eaff691.
This allows a transform to be defined as a module (resolvable from PATH), e.g. `mypackage.transforms.example_transform`, rather than having to defined it as a file (`/home/user/code/mypackage/transforms/example_transform.py`) This allows the possibility of creating generic transforms that can be packaged, installed, and re-used, without having to track down the filename of the python file where the transform code is located.
This commit the builds on the changes in a60c607, bfa87d3, and eaff691. It fully implements the mapping functionality that was present in the previous method of writing transforms, although with a new API. Instead of being given a large dict-of-dicts with mappings defined for terms, a method is passed via the KozaTransform object used in a transform, where a map lookup is done like so: def transform(koza: KozaTransform): term = "example" mapped_term = koza.lookup(term, "column_b") ...where the map was loaded from a CSV file that might look like this: id,column_a,column_b example,alias1,alias2 ...resulting in mapped_term evaluating to `"alias2"`.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[WIP description]