Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle two warnings that pollute the output of sssom-py CLI #561
Handle two warnings that pollute the output of sssom-py CLI #561
Changes from 3 commits
68917a0
21019c7
a13e224
c269ece
d0051bc
e0cbe73
634b1d9
d2b244e
fbee866
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather put that somewhere when the library is initialized, so that we can be sure the option is enabled everywhere at all times and not only when the code path goes through this particular function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it to sssom/init.py, not sure this is the right thing to do (can this effect other libraries that might not want that option to be set for some reason?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally be fine having it in
sssom/init.py
. This is forcing the behaviour to apply globally, which I believe is a good thing.That being said, it is true that this could have cascading effects for people who are using
sssom
as a library rather than as a command-line tool (people who importsssom
in their own Python projects, rather than callingsssom-py
from the command line), if they also happen to be usingpandas
in the rest of their code (no problem if their only use ofpandas
is throughsssom
).If you’re not comfortable forcing the “no downcasting” behaviour to all potential users of the
sssom
Python module¹, I can think of two options:(A) Setting the option somewhere at the beginning of the
main
function insssom/cli.py
. This means the option will be enabled globally in all ofsssom
when it is used as a CLI tool, but not when it is used as an imported module in another Python project.It will then be up to the people who imports
sssom
to decide whether they themselves want to enable the “no downcasting” behaviour, by setting (or not) the option.(B) Enabling the option within
sssom
, but only where we need it and only for the duration we need it. That is, we enable it before callingreplace
here, and disable it afterwards.A context manager would come in handy here:
It could then be used whenever we have to perform an operation and need to locally make sure the option (if it is available and if it is not already set) is enabled:
This is equivalent to:
except that it is more “pythonic” and it deals gracefully with the possible absence of that option or the possibility that the option has already been enabled elsewhere.
¹ But it should be noted that this behaviour will be forced upon them sooner or later anyway, since it will be the only behaviour available in Pandas 3.0… Better for those people to start dealing with it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infer_objects
does not work “in place” (and it doesn’t have ainplace
keyword to force it to do so), so that call does nothing since you’re not collecting the results.You’d need
As for compatibility with various versions of Pandas: the
copy
parameter was introduced in Pandas 2.0. You could leave it out:Without that
copy=False
, the default behaviour ofinfer_objects
would be to make a copy of all values in the frame, even if they are not casted to another type. But this is not a concern here since you are not keeping the original frame anyway.As for whether the downcasting is necessary to begin with, well, that would depend on the rest of the code downstream from here. For now (before this PR), the data frame is downcasted, as a side-effect of
replace
. The question is, is there some code downstream that is depending on the dataframe columns having more precise types thanobject
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it back with a pandas version check, can you see what you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but as I said you could also have just left the
copy
parameter out.For users of Pandas 1.x, this does not change anything. A copy of the non-type-casted objects will be made no matter what, that’s the only available behaviour.
For users of Pandas 3.x, this also does not change anything, since copy-on-write means that even if a copy of the non-type-casted objects is made, the actual copy would never actually take place (since we’re re-assigning the inferred data frame to the original variable).
For users of Pandas 2.x, no
copy=False
parameter means that the non-type-casted objects would be copied for nothing, which strictly speaking is a waste of time and memory. But I strongly suspect the effects on performances would be negligible.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Side note: I love the “should GitHub still exist when you read this”. :D )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I just want to do the thing that changes least about the existing behaviour - which I thought I understand form your explanations would be
copy=False
for Pandas 2.x?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do
then
copy=False
makes absolutely no difference to the result. All it does is making the operation slightly faster (which is why I included it in my example on Slack, but I now regret having done that if it led you to believe it was absolutely necessary), but the data frame you get at the end will be exactly the same as if you omitted thecopy
parameter.What does the
copy
parameter does exactly?Consider the following data frame:
Let’s create
df1
where we replace the empty cell (B2) withnp.nan
:Now let’s downcast, first with
copy=True
(default behaviour, same as nocopy
parameter at all):Then with
copy=False
:Observe that
df2_copy_true
anddf2_copy_false
are the same. So what’s the difference?The difference is that the A column in
df2_copy_false
share its data with the A column in the originaldf1
frame: A copy was not made, since there was no downcasting needed for that column.This means that if you modify in place a value in the A column of
df2_copy_false
, this will also modify the originaldf1
:By contrast,
df2_copy_true
is completely distinct from the originaldf1
. Even if the A column needed no downcast,infer_objects(copy=True)
made a complete copy of it.From the above, you should understand that you do not need to worry about getting different results depending on the presence of
copy=False
if all you do is re-assigning the output ofinfer_objects()
to the original variable – after the assignment the original data frame is no longer there anyway.So when you do this:
the only impact of the absence of
copy=False
is that you will needlessly make a temporary copy of columns that didn’t need to be copied, so it’s slightly less efficient thanWhether you should be concerned about this inefficiency is up to you. I am personally not concerned enough to think that it warrants making the code more complicated by adding this
_safe_infer_objects
intermediate, but it’s your call.In any case, the inefficiency will go away with Pandas 3.0 and its copy-on-write mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would in fact be keen to enable copy-on-write globally (
pd.set_option('mode.copy_on_write', True)
) right now, both for the performance improvement and to start making sure that we are ready to deal with it since it will become mandatory with Pandas 3.0.But the concerns stated above for the people who use
sssom
as a Python library would obviously also apply here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, we are in fact already enabling copy-on-write:
So there’s definitely no need for
copy=False
when we callinfer_objects()
here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for taking the time to explaining this! Really much appreciated!