-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unique levels, smoothing, and QuantileEncoder #327
Comments
That's a good question. I've read trough the target-encoder paper and there is no hard rule saying single-value categories should be assigned the prior. This was also raised by another user in #275 |
Hi @bmreiniger! Long time no see. About unique levels of a category. It's something that depends on the regularization. There are several types of regularization: leave-one-out, exponential smoothing, Gaussian noise... Quantile Encoder uses m-estimate (also known as additive smoothing)
|
Eventually, in my imaginary world. The user should be able to choose the desired regularizator as a parameter. |
To me, the line between a categorical level with just one observation or two is not so big; I wouldn't much trust the two-observation level either. So I would prefer not to have such a discontinuous treatment, especially hard-coded and out of reach of users. In the way of letting users control it, maybe having a parameter for minimum level size (what I originally thought I'd be happily in favor of higher regularization values by default, with a future warning for a few releases if that's required. I like Carlos's idea of generalizing regularization methods for many of the supervised encoders, for the long term. I worry about exploding the number of parameters though; maybe something like a separate regularizer class that gets passed as a parameter? That would basically leave just target, WOE, and quantile as the encoders, with m-estimate, minimum level size, catboost, LOO, James Stein, glmm as regularizers for them? |
Benchmarking regularization in target encodings is not done to the best of my knowledge. It's actually a non-trivial experiment. I don't have a good answer on how to handle levels with just one observation besides common regularization. Also, from an error and practical perspective. In a healthy dataset, if you have just one instance in train, there should be low chances that you have it in test. How much does this affect the error? Or we care about how the algorithm is constructed? Open question. For default hyperparameters, I am doing some experiments for a paper, let me see if I get some results that we can use. Still, further experimentation is needed (might lead to a research paper) For unifying supervised encoders, it might be good to sketch some diagrams to visualize the merge. A release like that might affect many users, and there are a fraction of users that won't understand the merge. @PaulWestenthanner what do you think? |
@bmreiniger, during a recent paper (https://arxiv.org/pdf/2201.11358.pdf) we study the impact of regularization in target encoding (see Figure 2). I was surprised to see that, for this particular case, regularization does not imply an improvement in model performance or is very minimal. (other datasets will be different). -- This will be an example that high default regularization wont help the user. |
From a technical point I agree with @bmreiniger that making a hard cut between once-observed labels and twice (or more often) observed labels does not make too much sense. So we should increase the default values for regularisation. From a point of releasing those I'd first include some future warning to all encoders where default parameters will change and only the release after actually change it. Also we should introduce a changelog / release overview / what's new page. About unifying supervised encoders: I agree we'd need a sketch first and only then decide. I think from a user's point of view it is more explicit to have all target encoders separate rather than just having one with a lot of functionality. If there are many keyword arguments specific to a particular encoder it makes more sense to keep them separate. From a coding point of view it's obviously nice to unify things. However that change would be rather big. Maybe for some version 3.x? |
Hi guys, I'm planning to do a release shortly and add the FutureWarnings for those who use the default parameters at the moment. As for new default parameters I just did some analysis: This plot shows the value of lambda as given in the paper plotted for n_samples from 1 to 100 for different k (=min_samples_leaf) and f (=smoothing). Based on this broad parameter range I've plotted one in greater detail: k=20, f=10 I would suggest k=20 and f=10 as new default parameters
Obviously these are hyper-parameters that need to be tuned according to the specific problem at hand but I guess the scores I give above are a good default. For completeness sake I'll add the code from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np
def lambda_coef(n, min_samples_leaf, smoothing):
return 1 / (1 + np.exp(-(n - min_samples_leaf) / smoothing))
min_samples = [1,3,5,8,13,21,34,55]
smoothing_params = [2**x for x in range(0,7)]
fig, axes = plt.subplots(len(min_samples), len(smoothing_params), figsize=(25,25))
x_axis = range(1, 100)
for idx1, min_s in enumerate(min_samples):
for idx2, smoothing in enumerate(smoothing_params):
y_axis = [lambda_coef(n, min_s, smoothing) for n in x_axis]
title = title=f"k={min_s};f={smoothing}"
axes[idx1, idx2].plot(x_axis, y_axis)
axes[idx1, idx2].set_title(title)
`` |
Very interesting :) Should not this depend on the used estimator? What happens if you apply this approach to the other regularizers? M-estimate and Gaussian Noise? These default hyperparameters won't apply to all encoding methods, will they? |
Hi, yes the best values will depend on the used estimator. The thing we're discussing in this issue is that the default values for smoothing and min_samples_leaf are just selected very poorly. |
If the regularization hyperparameter depends on the estimator and on the dataset used. Why this analysis is supposed to improve the default hyperparameters for Target Encoder? |
That's because the new defaults might be bad for some encoders but the current ones are bad for pretty much all encoders. |
I understand the need for "optimal" default hyperparameters. But how do you choose an optimal default hyperparameter when it will depend on both the data and the estimator? |
For example "I would suggest k=20 and f=10 as new default parameters" why not k=10 and f=20? |
I chose it because I think the values for lambda make sense
This is basically just my gut feeling saying that 50 data points is enough to trust the label.
This is much more likely to overfit since a unique level is weighed with 40% of its specific mean and also on the other end of the spectrum if a level occurs 50 times I'd be quite happy to give more than 90%. So that's why I like k=20, f=10 better than k=10, f=20. For reference the current defaults k=1, f=1 give those statistics:
I hope my argumentation makes sense for you even though it is not based on strict science but is rather a heuristic on what seems to make sense and provide a good starting point for hyperparameter optimisation |
I see. Thanks :) I have been wondering for a while if there is any more methodological way to estimate default hyperparameters. The problem also is relevant in the scikit-learn library. |
The test
./tests/test_encoders.py::TestEncoders::test_unique_column_is_not_predictive
fails forQuantileEncoder
. That new supervised encoder wasn't added to this test.My cursory understanding is that the other supervised encoders smooth things so that unique levels just get encoded with the prior. Is that really desired in general? It seems like a clunky exception, but if it is desired, can/should
QuantileEncoder
be adapted to do the same (@cmougan)?Discovered while trying to refactor the test to use a supervised-encoder tagging, #326
The text was updated successfully, but these errors were encountered: