Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT: Refactoring changes to CSV adapter + CSVArrayAdapter #803

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

genematx
Copy link
Contributor

@genematx genematx commented Oct 29, 2024

  • Allow CSVAdapter to accept kwargs for pd.read_csv (e.g. separator)
  • Remove dataframe_adapter property from CSVAdapter
  • Allow simple instantiation with either single uri or a list of uris
  • Specify multipart/related;type=text/csv mimetype
  • Experimental: Add a new CSVArrayAdaper backed by an ArrayAdapter (instead of TableAdapter). It can be used to load homogeneous numerical arrays stored as scv files. The distinction between the two is intended to be done by the mimetype: "text/csv;header=present" -- for tables, and "text/csv;header=absent" -- for arrays.
  • Unify the classmethod interfaces across the internal adapters (with from_assets and from_uris being the two primary methods).

Checklist

  • Add a Changelog entry
  • Add the ticket number which this PR closes to the comment section

@genematx genematx marked this pull request as ready for review October 29, 2024 18:46
@genematx genematx changed the title MNT: small refactoring changes to CSV adapter MNT: Refactoring changes to CSV adapter + CSVArrayAdapter Jan 14, 2025
@danielballan
Copy link
Member

danielballan commented Jan 16, 2025

I see two structural changes here:

  1. In main, we have registries that map mimetypes to any callable that returns an Adapter. In this PR, the registries map must map to an Adapter class. These Adapter classes adhere to an interface with several methods, used for various functions:
    i. "I want to create a new empty dataset of this format, to write into it." (init_storage)
    ii. "I found some files in this format that I want introspect, to register them in my catalog database." (from_uris)
    iii. "I previously introspected or wrote some files, and now I want to access them based on this information from my catalog database." (from_assets)
    As before, __init__, the "single privileged entrypoint", expects only exactly what it needs, and its signature varies from class to class.

  2. There is an additional change involved in (iii), from_assets. In this PR, it expects a flat list of Asset objects, which may point to:
    i. Files of mixed types and roles, such as TIFFs in an ordered sequence mixed with "sidecar" JSON or YAML metadata files
    ii. Files that must be tracked by the database for accounting purposes but are actually not used directly by the Adapter (e.g. HDF5 "data" files in a virtual dataset)

Some time ago, we encountered these examples and some others in the wild. We did not like the look of these Adapters:
(a) Their signatures did not explain to the caller what was expected: just "list of URIs".
(b) The Adapter code had to assign roles, distinguish scalars for lists, potentially sort list. It seemed wasteful and error-prone. When we introspect the files (ii) we know this information. We should write it down at that time and use it!

Thus, we added additional columns to the Asset--DataSource many-to-many relation table which encode which role and order each Asset plays. To be specific, with each row in the many-to-many relation table we also store, in addition to asset_id and and data_source_id, a parameter and a num:

parameter num What to do
<name> NULL Adapter.from_assets(<name>=asset.data_uri)
<name> <int> Adapter.from_assets(<name>=[asset.data_uri, ...])
NULL NULL Adapter.from_assets(...) # do not include asset in parameters
NULL <int> Disallowed by database triggers

I really like having standard separate paths for introspection (ii) and construction from DataSource/Asset(iii). That's a huge improvement.

I see in the implementation of from_assets a regression that (a) puts us back to homogenous vague signatures (assets: list[Assets]) which mask over diverse specific requirements and (b) leaves the Adapters redoing work (and perhaps making mistakes of missing errors) that was already done and recorded when the data was registered or written, as the case may be.


Now on shakier ground, just brainstorming...

Can we get the best of both worlds? As a starting point, convenience function like this might work:

def from_node_and_data_source(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
    "Usage example: from_data_source(CSVAdapter, node, data_source)"
    parameters = defaultdict(list)
    for asset in data_source.assets:
        if asset.num is None:
            # This asset is associated with a parameter that takes a single URI.
            parameters[asset.parameter] = asset.data_uri
        else:
            # This asset is associated with a parameter that takes a list of URIs.
            parameters[asset.parameter].append(asset.data_uri)
    return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters)

(A further convenience could wrap this and figure out the adapter_cls automatically by looking up data_source.mimetype in an adapter_by_mimetype registry, i.e. f(node, data_source) -> Adapter.)

Thus, we would drop the from_assets constructor in favor of using the explicit cls.__init__, which declares in its signature precisely what it expects: whether it be one URI, a list of URIs, or multiple different URIs/lists with different formats and/or roles to play.

We would retain two great aspects of this PR:

  • an easy way to construct Adapters from data_sources and their assets
  • a clear separation between between the introspection/registration path and the "from assets" path

There may be better implementations of these goals available, but I hope this is a promising example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants