handling of data where each value has a unit label #267

smason · 2025-01-02T13:49:30Z

Some background: I've been trying to work with a data file that has values labelled with units. For example, one column contains "5.2 t" and "72 kg". Other values are more awkward and don't have a well known value, so might contain "72.5–89 cm" and "7–9 m".

I've had a few issues; in the simple case of non-ranged values, any missing data gets assigned a "dimensionless" unit and then fails to convert to the correct type. For example:

data = """mass
1,1lb
2,
"""
pd.read_csv(io.StringIO(data), dtype=dict(mass="pint[kg]"))

fails on the second row with DimensionalityError: Cannot convert from 'dimensionless' (dimensionless) to 'pound' ([mass]).

As a minor issue I've noticed that the resulting unit comes from the first row rather than the requested unit. In the above code, removing the second row results in a dtype of pint[pound] rather than pint[kilogram].

When trying to deal with the ranged values, I've been using a regex to split these values apart and then get then back into lower&upper columns. I can get them out with a simple regex, but then struggle to reproduce the nice unit parsing behavior given by read_csv. For example:

pd.Series(["100kg", "1 t"], dtype="pint[t]")

doesn't fail, but the values remain as strings hence subsequent numerical operations fail. A kind stackoverflow user suggested applying the scalar constructor first:

pd.Series(["100kg", "1 t"]).apply(ureg.Quantity).astype("pint[t]")

which works here, but feels sub-optimal. I should note that this also fails for missing values similarly to the read_csv example above.

Not sure if this issue is the right place to report all this feedback—I personally like having everything in one place. If you think these are worth fixing I could have a go at working on pull-requests to fix these (I'm counting at least three separate issues here).

The text was updated successfully, but these errors were encountered:

andrewgsavage · 2025-01-02T14:17:25Z

yea go for it

I didn't realise you could specify the dtype like that in read_csv, that'd be worth adding to the docs too

smason · 2025-01-02T15:33:38Z

yea go for it

I was kind of expecting the response to be that I'm using the library incorrectly, interesting that's not obviously the case!

Will put tests into testsuite/test_issues.py as I go, let me know if you have other preferences!

I didn't realise you could specify the dtype like that in read_csv, that'd be worth adding to the docs too

I like to minimize example code as much as possible

partial fix for hgrecco#267 previously the code would fail with: DimensionalityError: Cannot convert from 'dimensionless' (dimensionless) to 'someunit' ([mass]).

smason added a commit to smason/pint-pandas that referenced this issue Jan 2, 2025

allow missing values in columns

1244b39

partial fix for hgrecco#267 previously the code would fail with: DimensionalityError: Cannot convert from 'dimensionless' (dimensionless) to 'someunit' ([mass]).

smason mentioned this issue Jan 2, 2025

allow missing values in columns #268

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handling of data where each value has a unit label #267

handling of data where each value has a unit label #267

smason commented Jan 2, 2025

andrewgsavage commented Jan 2, 2025

smason commented Jan 2, 2025

handling of data where each value has a unit label #267

handling of data where each value has a unit label #267

Comments

smason commented Jan 2, 2025

andrewgsavage commented Jan 2, 2025

smason commented Jan 2, 2025