-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OneHotEncoder(sparse=True) #230
Comments
This is a valid feature request. Commits are welcomed. |
Awesome, thanks. I'll take a look at the code, and if I can easily implement it on my own, I'll take a stab at it. |
May I ask why the project wants to re-implement encoders that are already part of sklearn? I thought it was complementing sklearn in way by only adding encoders that are not available there yet? |
I see two reasons why it may be desirable:
|
I totally agree with point 1. >>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> df = pd.DataFrame([("a",), ("b", )], columns=["foo"])
>>> enc.fit(df)
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<class 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
>>> enc.transform(df)
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format> This is for sklearn version 0.21.2 Concerning the issue: What do you think about just wrapping around sklearn's OneHotEncoder. That way we would have all features available. We'd also be befitting from future updates/enhancements in sklearn. |
Honestly, I think it might make sense to open a PR to sklearn to port these 2 features from category_encoders What do you all think? |
Also, category_encoders has
What's interesting is that the sklearn OneHotEncoder has an One thing I really like about category_encoders is that every encoder (except hashing, which doesn't need it) has an |
Good to see improving support for pandas DataFrames in sklearn.
I am leaving the decision up to @wdm0006. It would be necessary:
The best possible outcome I can think of is adding the missing functionality into sklearn and porting category_encoders to use sklearn encoders. Edited: @zachmayer was faster. But good to see the similarity of the ideas. |
Should I make a feature request on sklearn to add |
Here's a request to add handle_missing to the OneHotEncoder in sklearn scikit-learn/scikit-learn#11996 |
I also opened an issue for OrdinalEncoder: scikit-learn/scikit-learn#17123 |
sklearn.preprocessing.OneHotEncoder has the option
sparse=True
, to return the output in a scipy.sparse matrix. This can be really useful if you have categories with high cardinality.Would it be possible to add a
sparse=True
option tocategory_encoders.one_hot.OneHotEncoder
?The text was updated successfully, but these errors were encountered: