Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneHotEncoder(sparse=True) #230

Open
zachmayer opened this issue Jan 7, 2020 · 11 comments
Open

OneHotEncoder(sparse=True) #230

zachmayer opened this issue Jan 7, 2020 · 11 comments

Comments

@zachmayer
Copy link

zachmayer commented Jan 7, 2020

sklearn.preprocessing.OneHotEncoder has the option sparse=True, to return the output in a scipy.sparse matrix. This can be really useful if you have categories with high cardinality.

Would it be possible to add a sparse=True option to category_encoders.one_hot.OneHotEncoder?

@janmotl
Copy link
Collaborator

janmotl commented Jan 7, 2020

This is a valid feature request. Commits are welcomed.

@zachmayer
Copy link
Author

Awesome, thanks. I'll take a look at the code, and if I can easily implement it on my own, I'll take a stab at it.

@PaulWestenthanner
Copy link
Collaborator

May I ask why the project wants to re-implement encoders that are already part of sklearn? I thought it was complementing sklearn in way by only adding encoders that are not available there yet?

@janmotl
Copy link
Collaborator

janmotl commented May 3, 2020

I see two reasons why it may be desirable:

  1. For being able to quickly compare multiple encoders. By having them all in a single package, you may reasonably expect them to use the same interface everywhere, making it easy to compare one method with another.
  2. Support for pandas DataFrames.

@PaulWestenthanner
Copy link
Collaborator

I totally agree with point 1.
For point 2 I think sklearn also supports pandas DataFrames:

>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> df = pd.DataFrame([("a",), ("b", )], columns=["foo"])
>>> enc.fit(df)
OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=True)
>>> enc.transform(df)
<2x2 sparse matrix of type '<class 'numpy.float64'>'
        with 2 stored elements in Compressed Sparse Row format>

This is for sklearn version 0.21.2

Concerning the issue: What do you think about just wrapping around sklearn's OneHotEncoder. That way we would have all features available. We'd also be befitting from future updates/enhancements in sklearn.

@zachmayer
Copy link
Author

category_encoders.one_hot.OneHotEncoder has 2 additional features I often use that are not in sklearn.preprocessing.OneHotEncoder:

  1. drop_invariant=True to drop columns with zero variance (e.g. a categorical feature that is all one level).
  2. handle_missing=True to encode NaNs as their own level (rather than erroring).

Honestly, I think it might make sense to open a PR to sklearn to port these 2 features from category_encoders

What do you all think?

@zachmayer
Copy link
Author

Also, category_encoders has category_encoders.ordinal.OrdinalEncoder while sklearn has sklearn.preprocessing.OrdinalEncoder. In this case OrdinalEncoder has 3 features missing from sklearn:

  1. drop_invariant=True.
  2. handle_missing=True.
  3. handle_unknown=True to handle encoding for new categories.

What's interesting is that the sklearn OneHotEncoder has an handle_unknown option while the sklearn OneHotEncoder does not.

One thing I really like about category_encoders is that every encoder (except hashing, which doesn't need it) has an handle_missing and handle_unknown option. It'd be really useful to have both of these options in the sklearn encoders too.

@janmotl
Copy link
Collaborator

janmotl commented May 3, 2020

Good to see improving support for pandas DataFrames in sklearn.

What do you think about just wrapping around sklearn's OneHotEncoder. That way we would have all features available. We'd also be befitting from future updates/enhancements in sklearn.

I am leaving the decision up to @wdm0006. It would be necessary:

  1. Wrap OneHotEncoder and OrdinalEncoder.
  2. Get the wrappers to pass the tests. Or change the unit tests and all the remaining encoders to behave more like sklearn encoders. Possible difficulties: different/missing arguments like mapping, handle_missing or handle_unknown. And handling of "ordered" Categoricals from pandas.

The best possible outcome I can think of is adding the missing functionality into sklearn and porting category_encoders to use sklearn encoders.

Edited: @zachmayer was faster. But good to see the similarity of the ideas.

@zachmayer
Copy link
Author

Should I make a feature request on sklearn to add handle_missing and handle_unknown to their cat encoders?

@zachmayer
Copy link
Author

Here's a request to add handle_missing to the OneHotEncoder in sklearn scikit-learn/scikit-learn#11996

@zachmayer
Copy link
Author

I also opened an issue for OrdinalEncoder: scikit-learn/scikit-learn#17123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants