Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(datafile): use len(obj) rather than obj.get_nrecords() #2215

Merged
merged 1 commit into from
Jun 11, 2024

Conversation

mwtoews
Copy link
Contributor

@mwtoews mwtoews commented Jun 7, 2024

This PR has a few aims related to data files, including FormattedHeadFile, HeadFile and CellBudgetFile.

These files have a "number of records" property that was implemented with get_nrecords(). This "length of the object" measure is more naturally done with __len__, i.e. len(headsfile).

It is advised to prefer the len() approach, so instances of get_nrecords() for these files show a DeprecationWarning.

The CellBudgetFile also has a .nrecords property. It is also advised to show a DeprecationWarning with this property.

This PR also fixes a bug with get_nrecords() shown here:

def get_nrecords(self):
if isinstance(self.recordarray, np.recarray):
return self.recordarray.shape[0]
return 0

the bug is that recordarray is a structured array (np.ndarray), not a record array, so this always silently returns 0. This bug does not apply to CellBudgetFile, which worked fine and matches obj.nrecords.

Fixing this bug caused this test to fail, since the for-loop was never activated. A "todo" note is added since the reversed header is not the same as the original header.

Another aim of this PR is to re-organize a few CellBudgetFile tests from test_binaryfile.py to test_cellbudgetfile.py. Most of this is copied with perhaps minor simplifications.

Note that none of these changes apply to flopy.utils.swroutputfile.SwrBudget.get_nrecords(), which returns a tuple.


There is also still room for discussion if get_nrecords() or a dynamic .nrecords property should be preferred over len(obj). This PR can be adjusted accordingly. Opinions welcome!

@langevin-usgs
Copy link
Contributor

I'm in favor of len(obj). Sadly, len doesn't correspond to the number of times the data is available, but rather the number of records, which are written by layer with DIS/DISV. Whatever we can do to make our objects as pythonic as possible seems to be a step in the right direction. The recordarray misnomer is something that has been propagated for years due to a misunderstanding on our (my) part regarding the differences between structured arrays and record arrays -- something to keep in mind with a needed refactoring.

@wpbonelli
Copy link
Member

Sadly, len doesn't correspond to the number of times the data is available, but rather the number of records, which are written by layer with DIS/DISV.

Agreed len() suggests to me the number of rows in tabular data, as in numpy and pandas. A couple thoughts on this (hopefully not too off-base — maybe I have misunderstood something)

  • Flopy's approach to model outputs is to provide classes wrapping MODFLOW/etc output files, which may contain several tables of different shapes. This could be seen as a leaky abstraction. Does flopy really need to reproduce MODFLOW output file structure in user-facing APIs?
  • Flopy uses "record" to mean "table of some kind of data", where it commonly means a single row of tabular data, so "number of records"/"length of the object" really means "number of separate tables in this output file"

Maybe a future flopy could consider static routines to read model outputs e.g.

flopy.read_binary_file("path/to/file.cbc", data="FLOW RIGHT FACE")

File classes could live on under the hood, e.g. for indexing/caching, precision detection, etc, but it seems ideal to try to hide this and return one table at a time? The user would still need to know which sort of data can be found in which output file, but even that seems like it could be hidden with suitable shortcuts on model.output.

In the meantime I like len(obj) as sugar for get_nrecords() though I wonder if there is some way to emphasize this is not a row (i.e. time) count

@mwtoews
Copy link
Contributor Author

mwtoews commented Jun 11, 2024

Reply to a few points of discussion...

Agreed len() suggests to me the number of rows in tabular data, as in numpy and pandas.

Xref to #2221 which adds a .headers data frame with this number of headers.

Maybe a future flopy could consider static routines to read model outputs e.g.

flopy.read_binary_file("path/to/file.cbc", data="FLOW RIGHT FACE")

With CellBudget files, a somewhat expensive component is to build an index of headers, which is ideally done once, then re-used to get data from more than one component. This process wouldn't work well with static routines.

In the meantime I like len(obj) as sugar for get_nrecords() though I wonder if there is some way to emphasize this is not a row (i.e. time) count

I was considering keeping/adding obj.nrecords to be the same as len(obj), but the docstrings for __len__ should be sufficient without adding another property.

@wpbonelli
Copy link
Member

@mwtoews can you resolve the datafile.py conflict? then we can bring this in

@mwtoews mwtoews force-pushed the fix-nrecords-to-len branch from 5025802 to 2e97b8e Compare June 11, 2024 21:31
@wpbonelli wpbonelli merged commit e2d16df into modflowpy:develop Jun 11, 2024
24 checks passed
@mwtoews mwtoews deleted the fix-nrecords-to-len branch June 11, 2024 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants