Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package][PySpark] Expose Training and Validation Metrics #11133

Merged
merged 5 commits into from
Jan 13, 2025

Conversation

ayoub317
Copy link
Contributor

Closes #11132

@ayoub317 ayoub317 force-pushed the expose-metrics branch 9 times, most recently from 3d7161a to 99fc349 Compare December 29, 2024 20:53


@dataclass
class _XGBoostTrainingSummary:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be XGBoostTrainingSummary ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@@ -0,0 +1,43 @@
"""Xgboost training summary integration submodule."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about renaming xgboost_training_summary.py to summary.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@wbo4958
Copy link
Contributor

wbo4958 commented Jan 3, 2025

Hi @ayoub317, Could you be able to add some unit tests for this feature?

@ayoub317
Copy link
Contributor Author

ayoub317 commented Jan 3, 2025

Hi @wbo4958,

Yes, with pleasure ! Could you please provide some pointers on where I should define these tests ?

I had a quick look at the test codebase, and I would assume the following:

The tests for the regressor and classifier could be placed under:
xgboost -> tests -> test_distributed -> test_with_spark -> test_spark_local.py -> in the class TestPySparkLocal

The test for the ranker could be placed under:
xgboost -> tests -> test_distributed -> test_with_spark -> test_spark_local.py -> in the class TestPySparkLocalLETOR

Does that sound like a good choice ?

Thanks !

@wbo4958
Copy link
Contributor

wbo4958 commented Jan 3, 2025

The path you pasted should be ok for adding new testing.

@ayoub317 ayoub317 force-pushed the expose-metrics branch 8 times, most recently from 5828a0c to da80def Compare January 4, 2025 23:30
@ayoub317
Copy link
Contributor Author

ayoub317 commented Jan 5, 2025

Thanks @wbo4958 ! I pushed another commit adding the tests. Any feedback is welcome.

@@ -1148,7 +1151,7 @@ def _train_booster(
if dvalid is not None:
dval = [(dtrain, "training"), (dvalid, "validation")]
else:
dval = None
dval = [(dtrain, "training")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis, Could you check this is ok by enabling it by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @trivialfis, I wonder if this default eval dataset is necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, it's unlikely someone will train a model without any evaluation.


from .test_spark_local import spark as spark_local

logging.getLogger("py4j").setLevel(logging.INFO)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and since it was also set in test_spark_local.py, I kept it. Do you prefer that we remove it ?

@@ -0,0 +1,233 @@
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, if we could put the tests in this file into the existing test_spark_local.py and reuse the existing test data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can move them there without much effort. Let me know if you'd like me to proceed with that.
However, in my humble opinion, it’s better to keep them in this separate file, and here’s the rationale :
The test_spark_local.py file already exceeds 1800 lines of code, which makes it increasingly difficult to read, maintain and navigate. As new features are added to PySpark XGBoost, this file will only continue to grow, compounding the problem.
I think refactoring the tests to organize them by key features, rather than bundling everything under the TestPySparkLocal class would be a better long-term approach.

If we decide to keep the tests in this file, I can either leave the examples here as they are, or, as you suggested, for better modularity and data reuse, we could import them from test_spark_local.py. Another option is to store all shared data in a separate file, allowing both test_spark_local.py and test_xgboost_summary.py to import what they need from it.

Let me know what you think, I have no strong opinion on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, That's good point. Originally, I would like to separate the tests per the estimators. like XGBoostClassifier/Regressor/Ranker, instead of per features. So you can share the same dataset for different features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect ! I like that approach more than what I proposed ! I might work on it sometime soon if no one else does, especially since I have some other features I would like us to introduce.

assert not xgb_model.training_summary.validation_objective_history

@staticmethod
def assert_non_empty_training_objective_history(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could get the evaluate_results from xgboost itself and the training summary from xgboost-pyspark on the same dataset, and then check if they are equal? You can see some tests in test_spark_local.py are doing same comparison.

Copy link
Contributor Author

@ayoub317 ayoub317 Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. Thank you for pointing this out ! I tested this on a simple DataFrame locally, and the results matched perfectly. We should definitely add such tests, I’ll take care of that !

@ayoub317 ayoub317 force-pushed the expose-metrics branch 2 times, most recently from 1e98e82 to f36603a Compare January 10, 2025 21:42
@ayoub317
Copy link
Contributor Author

Thanks for the review @wbo4958. I added a new commit to address what we discussed.
It seems the failures in the CI are not related to this PR.

Copy link
Contributor

@wbo4958 wbo4958 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ayoub317, Thx for bringing summary feature into xgboost-pyspark.

LGTM.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@trivialfis
Copy link
Member

Let me fix the CI first #11162 .

@trivialfis trivialfis merged commit 461d27c into dmlc:master Jan 13, 2025
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python-package][PySpark] Expose Training and Validation Metrics
3 participants