Issue 481 - implement support for missing values with XGBoost #482

patrick-le-shopify · 2024-01-24T20:56:07Z

This PR attempts to implement support for missing values with XGBoost (#481).

The main logic change pertains to the initialization of the FeatureVector. As of now, the DenseFeatureVector has a default value of 0.0, and feature scores are filled in with actual values only when they are present. Effectively, features which are missing are given a value of 0.0. This happens "upstream," and by the time XGBoost model (NaiveAdditiveDecisionTree) is invoked, there is no longer any missing feature value.

The implementation here adds a new class called SparseFeatureVector that gives features a default value of Float.NaN, enabling XGBoost model to actually follow branches where the feature is missing.

patrick-le-shopify · 2024-01-24T20:58:28Z

Tagging @styrmis @wrigleyDan for awareness.

styrmis

Thank you for this! I like the approach—I've added just one comment. It was the explain output that in the end clarified the behaviour of the model inference in the plugin for me, so I think it would be worth maintaining the explicit reporting of default values used in that output if possible.

styrmis · 2024-01-25T10:20:52Z

src/main/java/com/o19s/es/ltr/query/RankerQuery.java

@@ -259,7 +259,7 @@ public Explanation explain(LeafReaderContext context, int doc) throws IOExceptio
                }
                featureString += ":";
                if (!explain.isMatch()) {
-                    subs.add(Explanation.noMatch(featureString + " [no match, default value 0.0 used]"));
+                    subs.add(Explanation.noMatch(featureString + " [no match, default value used]"));


As the feature vector implementations can return their default value (which may be NaN), could we report this here?

Without this information it would require careful inspection of the source to determine what the default is, where reporting it in the explain output would be more user friendly.

Good call! I've made changes to add what the default value is to the explanation.

wrigleyDan

Looks good to me, thanks for that contribution!

patrick-le-shopify added 4 commits January 24, 2024 10:57

sparse feature vector, naive decision tree

da938c1

add integration test

c891cf7

turn back on testLog tests

7cadf2d

minor edits

03a86b2

patrick-le-shopify changed the title ~~Issue 481~~ Issue 481 - implement support for missing values with XGBoost Jan 24, 2024

styrmis reviewed Jan 25, 2024

View reviewed changes

patrick-le-shopify added 3 commits January 25, 2024 10:42

add default value to explanation

57d9b5f

lint

6e2559c

add tests

5a166fb

wrigleyDan reviewed Jan 31, 2024

View reviewed changes

wrigleyDan merged commit d2bbe8b into o19s:main Jan 31, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 481 - implement support for missing values with XGBoost #482

Issue 481 - implement support for missing values with XGBoost #482

patrick-le-shopify commented Jan 24, 2024

patrick-le-shopify commented Jan 24, 2024

styrmis left a comment

styrmis Jan 25, 2024

patrick-le-shopify Jan 25, 2024

wrigleyDan left a comment

Issue 481 - implement support for missing values with XGBoost #482

Issue 481 - implement support for missing values with XGBoost #482

Conversation

patrick-le-shopify commented Jan 24, 2024

patrick-le-shopify commented Jan 24, 2024

styrmis left a comment

Choose a reason for hiding this comment

styrmis Jan 25, 2024

Choose a reason for hiding this comment

patrick-le-shopify Jan 25, 2024

Choose a reason for hiding this comment

wrigleyDan left a comment

Choose a reason for hiding this comment