Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle two warnings that pollute the output of sssom-py CLI #561

Merged
merged 9 commits into from
Nov 9, 2024
2 changes: 1 addition & 1 deletion src/sssom/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -425,7 +425,7 @@ def from_sssom_dataframe(
# This is to address: A value is trying to be set on a copy of a slice from a DataFrame
if CONFIDENCE in df.columns:
df2 = df.copy()
df2[CONFIDENCE].replace(r"^\s*$", np.nan, regex=True, inplace=True)
df2[CONFIDENCE] = df2[CONFIDENCE].replace(r"^\s*$", np.nan, regex=True)
matentzn marked this conversation as resolved.
Show resolved Hide resolved
df = df2

mapping_set = _get_mapping_set_from_df(df=df, meta=meta)
Expand Down
8 changes: 8 additions & 0 deletions src/sssom/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,15 @@ def from_mapping_set_document(cls, doc: MappingSetDocument) -> "MappingSetDataFr
# For pandas < 2.0.0, call 'infer_objects()' without any parameters
df = df.infer_objects()
# remove columns where all values are blank.

# Context https://github.com/pandas-dev/pandas/issues/57734
try:
pd.set_option("future.no_silent_downcasting", True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather put that somewhere when the library is initialized, so that we can be sure the option is enabled everywhere at all times and not only when the code path goes through this particular function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it to sssom/init.py, not sure this is the right thing to do (can this effect other libraries that might not want that option to be set for some reason?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally be fine having it in sssom/init.py. This is forcing the behaviour to apply globally, which I believe is a good thing.

That being said, it is true that this could have cascading effects for people who are using sssom as a library rather than as a command-line tool (people who import sssom in their own Python projects, rather than calling sssom-py from the command line), if they also happen to be using pandas in the rest of their code (no problem if their only use of pandas is through sssom).

If you’re not comfortable forcing the “no downcasting” behaviour to all potential users of the sssom Python module¹, I can think of two options:

(A) Setting the option somewhere at the beginning of the main function in sssom/cli.py. This means the option will be enabled globally in all of sssom when it is used as a CLI tool, but not when it is used as an imported module in another Python project.

It will then be up to the people who imports sssom to decide whether they themselves want to enable the “no downcasting” behaviour, by setting (or not) the option.

(B) Enabling the option within sssom, but only where we need it and only for the duration we need it. That is, we enable it before calling replace here, and disable it afterwards.

A context manager would come in handy here:

class pandas_no_silent_downcasting():

    def __init__(self):
        try:
            self._already_set = pd.get_option('future.no_silent_downcasting')
            self._supported = True
        except pd.errors.OptionError:
            self._supported = False

    def __enter__(self):
        if self._supported and not self._already_set:
            # Entering the context, set the option
            pd.set_option('future.no_silent_downcasting', True)

    def __exit__(self, type, value, tb):
        if self._supported and not self._already_set:
            # Leaving the context, unset the option
            pd.set_option('future.no_silent_downcasting', False)

It could then be used whenever we have to perform an operation and need to locally make sure the option (if it is available and if it is not already set) is enabled:

with pandas_no_silent_downcasting():
    df.replace(...)

This is equivalent to:

pd.set_option('future.no_silent_downcasting', True)
df.replace(...)
pd.set_option('future.no_silent_downcasting', False)

except that it is more “pythonic” and it deals gracefully with the possible absence of that option or the possibility that the option has already been enabled elsewhere.


¹ But it should be noted that this behaviour will be forced upon them sooner or later anyway, since it will be the only behaviour available in Pandas 3.0… Better for those people to start dealing with it now.

except KeyError:
# Option does not exist in this version of pandas
pass
df.replace("", np.nan, inplace=True)
# df.infer_objects(copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer_objects does not work “in place” (and it doesn’t have a inplace keyword to force it to do so), so that call does nothing since you’re not collecting the results.

You’d need

df = df.infer_objects(copy=False)

As for compatibility with various versions of Pandas: the copy parameter was introduced in Pandas 2.0. You could leave it out:

df = df.infer_objects()

Without that copy=False, the default behaviour of infer_objects would be to make a copy of all values in the frame, even if they are not casted to another type. But this is not a concern here since you are not keeping the original frame anyway.

As for whether the downcasting is necessary to begin with, well, that would depend on the rest of the code downstream from here. For now (before this PR), the data frame is downcasted, as a side-effect of replace. The question is, is there some code downstream that is depending on the dataframe columns having more precise types than object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it back with a pandas version check, can you see what you think?

Copy link
Contributor

@gouttegd gouttegd Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but as I said you could also have just left the copy parameter out.

For users of Pandas 1.x, this does not change anything. A copy of the non-type-casted objects will be made no matter what, that’s the only available behaviour.

For users of Pandas 3.x, this also does not change anything, since copy-on-write means that even if a copy of the non-type-casted objects is made, the actual copy would never actually take place (since we’re re-assigning the inferred data frame to the original variable).

For users of Pandas 2.x, no copy=False parameter means that the non-type-casted objects would be copied for nothing, which strictly speaking is a waste of time and memory. But I strongly suspect the effects on performances would be negligible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Side note: I love the “should GitHub still exist when you read this”. :D )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I just want to do the thing that changes least about the existing behaviour - which I thought I understand form your explanations would be copy=False for Pandas 2.x?

Copy link
Contributor

@gouttegd gouttegd Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do

df = df.infer_objects(copy=False)

then copy=False makes absolutely no difference to the result. All it does is making the operation slightly faster (which is why I included it in my example on Slack, but I now regret having done that if it led you to believe it was absolutely necessary), but the data frame you get at the end will be exactly the same as if you omitted the copy parameter.

What does the copy parameter does exactly?

Consider the following data frame:

>>> df
         A   B
0  alice     1
1  bob       2
2  charlie     

Let’s create df1 where we replace the empty cell (B2) with np.nan:

>>> pd.set_option('future.no_silent_downcasting', True)
>>> df1 = df.replace("", np.nan)
>>> df1
         A    B
0  alice      1
1  bob        2
2  charlie  NaN

Now let’s downcast, first with copy=True (default behaviour, same as no copy parameter at all):

>>> df2_copy_true = df1.infer_objects(copy=True)
>>> df2_copy_true
         A     B
0  alice     1.0
1  bob       2.0
2  charlie   NaN

Then with copy=False:

>>> df2_copy_false = df1.infer_objects(copy=False)
>>> df2_copy_false
         A     B
0  alice     1.0
1  bob       2.0
2  charlie   NaN

Observe that df2_copy_true and df2_copy_false are the same. So what’s the difference?

The difference is that the A column in df2_copy_false share its data with the A column in the original df1 frame: A copy was not made, since there was no downcasting needed for that column.

This means that if you modify in place a value in the A column of df2_copy_false, this will also modify the original df1:

>>> df2_copy_false.replace("alice", "arthur", inplace=True)
>>> df2_copy_false
         A     B
0  arthur    1.0
1  bob       2.0
2  charlie   NaN
>>> df1
         A     B
0  arthur      1
1  bob         2
2  charlie   NaN

By contrast, df2_copy_true is completely distinct from the original df1. Even if the A column needed no downcast, infer_objects(copy=True) made a complete copy of it.

From the above, you should understand that you do not need to worry about getting different results depending on the presence of copy=False if all you do is re-assigning the output of infer_objects() to the original variable – after the assignment the original data frame is no longer there anyway.

So when you do this:

df = df.infer_objects()

the only impact of the absence of copy=False is that you will needlessly make a temporary copy of columns that didn’t need to be copied, so it’s slightly less efficient than

df = df.infer_objects(copy=False)

Whether you should be concerned about this inefficiency is up to you. I am personally not concerned enough to think that it warrants making the code more complicated by adding this _safe_infer_objects intermediate, but it’s your call.

In any case, the inefficiency will go away with Pandas 3.0 and its copy-on-write mechanism.

Copy link
Contributor

@gouttegd gouttegd Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, the inefficiency will go away with Pandas 3.0 and its copy-on-write mechanism.

I would in fact be keen to enable copy-on-write globally (pd.set_option('mode.copy_on_write', True)) right now, both for the performance improvement and to start making sure that we are ready to deal with it since it will become mandatory with Pandas 3.0.

But the concerns stated above for the people who use sssom as a Python library would obviously also apply here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would in fact be keen to enable copy-on-write globally

Wait, we are in fact already enabling copy-on-write:

pd.options.mode.copy_on_write = True

So there’s definitely no need for copy=False when we call infer_objects() here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for taking the time to explaining this! Really much appreciated!

df.dropna(axis=1, how="all", inplace=True) # remove columns with all row = 'None'-s.

slots = _get_sssom_schema_object().dict["slots"]
Expand Down
Loading