-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 481 - implement support for missing values with XGBoost #482
Conversation
Tagging @styrmis @wrigleyDan for awareness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this! I like the approach—I've added just one comment. It was the explain output that in the end clarified the behaviour of the model inference in the plugin for me, so I think it would be worth maintaining the explicit reporting of default values used in that output if possible.
@@ -259,7 +259,7 @@ public Explanation explain(LeafReaderContext context, int doc) throws IOExceptio | |||
} | |||
featureString += ":"; | |||
if (!explain.isMatch()) { | |||
subs.add(Explanation.noMatch(featureString + " [no match, default value 0.0 used]")); | |||
subs.add(Explanation.noMatch(featureString + " [no match, default value used]")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the feature vector implementations can return their default value (which may be NaN
), could we report this here?
Without this information it would require careful inspection of the source to determine what the default is, where reporting it in the explain output would be more user friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! I've made changes to add what the default value is to the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for that contribution!
This PR attempts to implement support for missing values with XGBoost (#481).
The main logic change pertains to the initialization of the FeatureVector. As of now, the DenseFeatureVector has a default value of 0.0, and feature scores are filled in with actual values only when they are present. Effectively, features which are missing are given a value of 0.0. This happens "upstream," and by the time XGBoost model (NaiveAdditiveDecisionTree) is invoked, there is no longer any missing feature value.
The implementation here adds a new class called SparseFeatureVector that gives features a default value of Float.NaN, enabling XGBoost model to actually follow branches where the feature is missing.