Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling of data where each value has a unit label #267

Open
smason opened this issue Jan 2, 2025 · 2 comments
Open

handling of data where each value has a unit label #267

smason opened this issue Jan 2, 2025 · 2 comments

Comments

@smason
Copy link
Contributor

smason commented Jan 2, 2025

Some background: I've been trying to work with a data file that has values labelled with units. For example, one column contains "5.2 t" and "72 kg". Other values are more awkward and don't have a well known value, so might contain "72.5–89 cm" and "7–9 m".

I've had a few issues; in the simple case of non-ranged values, any missing data gets assigned a "dimensionless" unit and then fails to convert to the correct type. For example:

data = """mass
1,1lb
2,
"""
pd.read_csv(io.StringIO(data), dtype=dict(mass="pint[kg]"))

fails on the second row with DimensionalityError: Cannot convert from 'dimensionless' (dimensionless) to 'pound' ([mass]).

As a minor issue I've noticed that the resulting unit comes from the first row rather than the requested unit. In the above code, removing the second row results in a dtype of pint[pound] rather than pint[kilogram].

When trying to deal with the ranged values, I've been using a regex to split these values apart and then get then back into lower&upper columns. I can get them out with a simple regex, but then struggle to reproduce the nice unit parsing behavior given by read_csv. For example:

pd.Series(["100kg", "1 t"], dtype="pint[t]")

doesn't fail, but the values remain as strings hence subsequent numerical operations fail. A kind stackoverflow user suggested applying the scalar constructor first:

pd.Series(["100kg", "1 t"]).apply(ureg.Quantity).astype("pint[t]")

which works here, but feels sub-optimal. I should note that this also fails for missing values similarly to the read_csv example above.

Not sure if this issue is the right place to report all this feedback—I personally like having everything in one place. If you think these are worth fixing I could have a go at working on pull-requests to fix these (I'm counting at least three separate issues here).

@andrewgsavage
Copy link
Collaborator

yea go for it

I didn't realise you could specify the dtype like that in read_csv, that'd be worth adding to the docs too

@smason
Copy link
Contributor Author

smason commented Jan 2, 2025

yea go for it

I was kind of expecting the response to be that I'm using the library incorrectly, interesting that's not obviously the case!

Will put tests into testsuite/test_issues.py as I go, let me know if you have other preferences!

I didn't realise you could specify the dtype like that in read_csv, that'd be worth adding to the docs too

I like to minimize example code as much as possible

smason added a commit to smason/pint-pandas that referenced this issue Jan 2, 2025
partial fix for hgrecco#267

previously the code would fail with:

  DimensionalityError: Cannot convert from 'dimensionless' (dimensionless) to 'someunit' ([mass]).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants