OneHotEncoder(sparse=True) #230

zachmayer · 2020-01-07T16:41:50Z

sklearn.preprocessing.OneHotEncoder has the option sparse=True, to return the output in a scipy.sparse matrix. This can be really useful if you have categories with high cardinality.

Would it be possible to add a sparse=True option to category_encoders.one_hot.OneHotEncoder?

The text was updated successfully, but these errors were encountered:

janmotl · 2020-01-07T19:25:16Z

This is a valid feature request. Commits are welcomed.

zachmayer · 2020-01-07T19:31:31Z

Awesome, thanks. I'll take a look at the code, and if I can easily implement it on my own, I'll take a stab at it.

PaulWestenthanner · 2020-05-02T18:27:19Z

May I ask why the project wants to re-implement encoders that are already part of sklearn? I thought it was complementing sklearn in way by only adding encoders that are not available there yet?

janmotl · 2020-05-03T08:42:00Z

I see two reasons why it may be desirable:

For being able to quickly compare multiple encoders. By having them all in a single package, you may reasonably expect them to use the same interface everywhere, making it easy to compare one method with another.
Support for pandas DataFrames.

PaulWestenthanner · 2020-05-03T12:25:02Z

I totally agree with point 1.
For point 2 I think sklearn also supports pandas DataFrames:

>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> df = pd.DataFrame([("a",), ("b", )], columns=["foo"])
>>> enc.fit(df)
OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=True)
>>> enc.transform(df)
<2x2 sparse matrix of type '<class 'numpy.float64'>'
        with 2 stored elements in Compressed Sparse Row format>

This is for sklearn version 0.21.2

Concerning the issue: What do you think about just wrapping around sklearn's OneHotEncoder. That way we would have all features available. We'd also be befitting from future updates/enhancements in sklearn.

zachmayer · 2020-05-03T12:40:10Z

category_encoders.one_hot.OneHotEncoder has 2 additional features I often use that are not in sklearn.preprocessing.OneHotEncoder:

drop_invariant=True to drop columns with zero variance (e.g. a categorical feature that is all one level).
handle_missing=True to encode NaNs as their own level (rather than erroring).

Honestly, I think it might make sense to open a PR to sklearn to port these 2 features from category_encoders

What do you all think?

zachmayer · 2020-05-03T12:44:01Z

Also, category_encoders has category_encoders.ordinal.OrdinalEncoder while sklearn has sklearn.preprocessing.OrdinalEncoder. In this case OrdinalEncoder has 3 features missing from sklearn:

drop_invariant=True.
handle_missing=True.
handle_unknown=True to handle encoding for new categories.

What's interesting is that the sklearn OneHotEncoder has an handle_unknown option while the sklearn OneHotEncoder does not.

One thing I really like about category_encoders is that every encoder (except hashing, which doesn't need it) has an handle_missing and handle_unknown option. It'd be really useful to have both of these options in the sklearn encoders too.

janmotl · 2020-05-03T12:56:29Z

Good to see improving support for pandas DataFrames in sklearn.

What do you think about just wrapping around sklearn's OneHotEncoder. That way we would have all features available. We'd also be befitting from future updates/enhancements in sklearn.

I am leaving the decision up to @wdm0006. It would be necessary:

Wrap OneHotEncoder and OrdinalEncoder.
Get the wrappers to pass the tests. Or change the unit tests and all the remaining encoders to behave more like sklearn encoders. Possible difficulties: different/missing arguments like mapping, handle_missing or handle_unknown. And handling of "ordered" Categoricals from pandas.

The best possible outcome I can think of is adding the missing functionality into sklearn and porting category_encoders to use sklearn encoders.

Edited: @zachmayer was faster. But good to see the similarity of the ideas.

zachmayer · 2020-05-04T14:24:59Z

Should I make a feature request on sklearn to add handle_missing and handle_unknown to their cat encoders?

zachmayer · 2020-05-04T14:31:18Z

Here's a request to add handle_missing to the OneHotEncoder in sklearn scikit-learn/scikit-learn#11996

zachmayer · 2020-05-04T14:42:24Z

I also opened an issue for OrdinalEncoder: scikit-learn/scikit-learn#17123

janmotl added the enhancement label Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OneHotEncoder(sparse=True) #230

OneHotEncoder(sparse=True) #230

zachmayer commented Jan 7, 2020 •

edited

Loading

janmotl commented Jan 7, 2020

zachmayer commented Jan 7, 2020

PaulWestenthanner commented May 2, 2020

janmotl commented May 3, 2020

PaulWestenthanner commented May 3, 2020

zachmayer commented May 3, 2020

zachmayer commented May 3, 2020

janmotl commented May 3, 2020 •

edited

Loading

zachmayer commented May 4, 2020

zachmayer commented May 4, 2020

zachmayer commented May 4, 2020

OneHotEncoder(sparse=True) #230

OneHotEncoder(sparse=True) #230

Comments

zachmayer commented Jan 7, 2020 • edited Loading

janmotl commented Jan 7, 2020

zachmayer commented Jan 7, 2020

PaulWestenthanner commented May 2, 2020

janmotl commented May 3, 2020

PaulWestenthanner commented May 3, 2020

zachmayer commented May 3, 2020

zachmayer commented May 3, 2020

janmotl commented May 3, 2020 • edited Loading

zachmayer commented May 4, 2020

zachmayer commented May 4, 2020

zachmayer commented May 4, 2020

zachmayer commented Jan 7, 2020 •

edited

Loading

janmotl commented May 3, 2020 •

edited

Loading