Merge branch 'scikit-learn:main' into submodulev3

neurodata · Oct 31, 2023 · 5211846 · 5211846
2 parents f5c68b0 + a5fed0d
commit 5211846
Show file tree

Hide file tree

Showing 19 changed files with 722 additions and 143 deletions.
diff --git a/doc/metadata_routing.rst b/doc/metadata_routing.rst
@@ -252,6 +252,7 @@ Meta-estimators and functions supporting metadata routing:
 
 - :class:`sklearn.calibration.CalibratedClassifierCV`
 - :class:`sklearn.compose.ColumnTransformer`
+- :class:`sklearn.feature_selection.SelectFromModel`
 - :class:`sklearn.linear_model.ElasticNetCV`
 - :class:`sklearn.linear_model.LarsCV`
 - :class:`sklearn.linear_model.LassoCV`
@@ -290,7 +291,6 @@ Meta-estimators and tools not supporting metadata routing yet:
 - :class:`sklearn.ensemble.VotingRegressor`
 - :class:`sklearn.feature_selection.RFE`
 - :class:`sklearn.feature_selection.RFECV`
-- :class:`sklearn.feature_selection.SelectFromModel`
 - :class:`sklearn.feature_selection.SequentialFeatureSelector`
 - :class:`sklearn.impute.IterativeImputer`
 - :class:`sklearn.linear_model.RANSACRegressor`

diff --git a/doc/modules/feature_selection.rst b/doc/modules/feature_selection.rst
@@ -130,7 +130,13 @@ repeated on the pruned set until the desired number of features to select is
 eventually reached.
 
 :class:`RFECV` performs RFE in a cross-validation loop to find the optimal
-number of features.
+number of features. In more details, the number of features selected is tuned
+automatically by fitting an :class:`RFE` selector on the different
+cross-validation splits (provided by the `cv` parameter). The performance
+of the :class:`RFE` selector are evaluated using `scorer` for different number
+of selected features and aggregated together. Finally, the scores are averaged
+across folds and the number of features selected is set to the number of
+features that maximize the cross-validation score.
 
 .. topic:: Examples:
 

diff --git a/doc/whats_new/v1.4.rst b/doc/whats_new/v1.4.rst
@@ -35,6 +35,9 @@ random sampling procedures.
       solvers (when fit on the same data again). The amount of change depends on the
       specified `tol`, for small values you will get more precise results.
 
+- |Fix| fixes a memory leak seen in PyPy for estimators using the Cython loss functions.
+  :pr:`27670` by :user:`Guillaume Lemaitre <glemaitre>`.
+
 Changes impacting all modules
 -----------------------------
 
@@ -107,6 +110,10 @@ more details.
   ``**score_params`` which are passed to the underlying scorer.
   :pr:`26525` by :user:`Omar Salman <OmarManzoor>`.
 
+- |Feature| :class:`feature_selection.SelectFromModel` now supports metadata
+  routing in `fit` and `partial_fit`.
+  :pr:`27490` by :user:`Stefanie Senger <StefanieSenger>`.
+
 - |Feature| :class:`linear_model.OrthogonalMatchingPursuitCV` now supports
   metadata routing. Its `fit` now accepts ``**fit_params``, which are passed to
   the underlying splitter. :pr:`27500` by :user:`Stefanie Senger
@@ -130,8 +137,8 @@ and classes are impacted:
 
 **Functions:**
 
-- :func:`cluster.compute_optics_graph` in :pr:`27250` by
-  :user:`Yao Xiao <Charlie-XIAO>`;
+- :func:`cluster.compute_optics_graph` in :pr:`27104` by
+  :user:`Maren Westermann <marenwestermann>` and in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
 - :func:`cluster.kmeans_plusplus` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`;
 - :func:`decomposition.non_negative_factorization` in :pr:`27100` by
   :user:`Isaac Virshup <ivirshup>`;
@@ -153,7 +160,8 @@ and classes are impacted:
 - :class:`cluster.HDBSCAN` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
 - :class:`cluster.KMeans` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`;
 - :class:`cluster.MiniBatchKMeans` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`;
-- :class:`cluster.OPTICS` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
+- :class:`cluster.OPTICS` in :pr:`27104` by
+  :user:`Maren Westermann <marenwestermann>` and in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
 - :class:`decomposition.NMF` in :pr:`27100` by :user:`Isaac Virshup <ivirshup>`;
 - :class:`decomposition.MiniBatchNMF` in :pr:`27100` by
   :user:`Isaac Virshup <ivirshup>`;
@@ -467,6 +475,13 @@ Changelog
   which can be used to check whether a given set of parameters would be consumed.
   :pr:`26831` by `Adrin Jalali`_.
 
+- |Enhancement| Make :func:`sklearn.utils.check_array` attempt to output
+  `int32`-indexed CSR and COO arrays when converting from DIA arrays if the number of
+  non-zero entries is small enough. This ensures that estimators implemented in Cython
+  and that do not accept `int64`-indexed sparse datastucture, now consistently
+  accept the same sparse input formats for SciPy sparse matrices and arrays.
+  :pr:`27372` by :user:`Guillaume Lemaitre <glemaitre>`.
+
 - |Fix| :func:`sklearn.utils.check_array` should accept both matrix and array from
   the sparse SciPy module. The previous implementation would fail if `copy=True` by
   calling specific NumPy `np.may_share_memory` that does not work with SciPy sparse

diff --git a/examples/multiclass/README.txt b/examples/multiclass/README.txt
@@ -0,0 +1,6 @@
+.. _multiclass_examples:
+
+Multiclass methods
+------------------
+
+Examples concerning the :mod:`sklearn.multiclass` module.
diff --git a/examples/multiclass/plot_multiclass_overview.py b/examples/multiclass/plot_multiclass_overview.py
@@ -0,0 +1,201 @@
+"""
+===============================================
+Overview of multiclass training meta-estimators
+===============================================
+
+In this example, we discuss the problem of classification when the target
+variable is composed of more than two classes. This is called multiclass
+classification.
+
+In scikit-learn, all estimators support multiclass classification out of the
+box: the most sensible strategy was implemented for the end-user. The
+:mod:`sklearn.multiclass` module implements various strategies that one can use
+for experimenting or developing third-party estimators that only support binary
+classification.
+
+:mod:`sklearn.multiclass` includes OvO/OvR strategies used to train a
+multiclass classifier by fitting a set of binary classifiers (the
+:class:`~sklearn.multiclass.OneVsOneClassifier` and
+:class:`~sklearn.multiclass.OneVsRestClassifier` meta-estimators). This example
+will review them.
+"""
+
+# %%
+# The Yeast UCI dataset
+# ---------------------
+#
+# In this example, we use a UCI dataset [1]_, generally referred as the Yeast
+# dataset. We use the :func:`sklearn.datasets.fetch_openml` function to load
+# the dataset from OpenML.
+from sklearn.datasets import fetch_openml
+
+X, y = fetch_openml(data_id=181, as_frame=True, return_X_y=True, parser="pandas")
+
+# %%
+# To know the type of data science problem we are dealing with, we can check
+# the target for which we want to build a predictive model.
+y.value_counts().sort_index()
+
+# %%
+# We see that the target is discrete and composed of 10 classes. We therefore
+# deal with a multiclass classification problem.
+#
+# Strategies comparison
+# ---------------------
+#
+# In the following experiment, we use a
+# :class:`~sklearn.tree.DecisionTreeClassifier` and a
+# :class:`~sklearn.model_selection.RepeatedStratifiedKFold` cross-validation
+# with 3 splits and 5 repetitions.
+#
+# We compare the following strategies:
+#
+# * :class:~sklearn.tree.DecisionTreeClassifier can handle multiclass
+#   classification without needing any special adjustments. It works by breaking
+#   down the training data into smaller subsets and focusing on the most common
+#   class in each subset. By repeating this process, the model can accurately
+#   classify input data into multiple different classes.
+# * :class:`~sklearn.multiclass.OneVsOneClassifier` trains a set of binary
+#   classifiers where each classifier is trained to distinguish between
+#   two classes.
+# * :class:`~sklearn.multiclass.OneVsRestClassifier`: trains a set of binary
+#   classifiers where each classifier is trained to distinguish between
+#   one class and the rest of the classes.
+# * :class:`~sklearn.multiclass.OutputCodeClassifier`: trains a set of binary
+#   classifiers where each classifier is trained to distinguish between
+#   a set of classes from the rest of the classes. The set of classes is
+#   defined by a codebook, which is randomly generated in scikit-learn. This
+#   method exposes a parameter `code_size` to control the size of the codebook.
+#   We set it above one since we are not interested in compressing the class
+#   representation.
+import pandas as pd
+
+from sklearn.model_selection import RepeatedStratifiedKFold, cross_validate
+from sklearn.multiclass import (
+    OneVsOneClassifier,
+    OneVsRestClassifier,
+    OutputCodeClassifier,
+)
+from sklearn.tree import DecisionTreeClassifier
+
+cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=5, random_state=0)
+
+tree = DecisionTreeClassifier(random_state=0)
+ovo_tree = OneVsOneClassifier(tree)
+ovr_tree = OneVsRestClassifier(tree)
+ecoc = OutputCodeClassifier(tree, code_size=2)
+
+cv_results_tree = cross_validate(tree, X, y, cv=cv, n_jobs=2)
+cv_results_ovo = cross_validate(ovo_tree, X, y, cv=cv, n_jobs=2)
+cv_results_ovr = cross_validate(ovr_tree, X, y, cv=cv, n_jobs=2)
+cv_results_ecoc = cross_validate(ecoc, X, y, cv=cv, n_jobs=2)
+
+# %%
+# We can now compare the statistical performance of the different strategies.
+# We plot the score distribution of the different strategies.
+from matplotlib import pyplot as plt
+
+scores = pd.DataFrame(
+    {
+        "DecisionTreeClassifier": cv_results_tree["test_score"],
+        "OneVsOneClassifier": cv_results_ovo["test_score"],
+        "OneVsRestClassifier": cv_results_ovr["test_score"],
+        "OutputCodeClassifier": cv_results_ecoc["test_score"],
+    }
+)
+ax = scores.plot.kde(legend=True)
+ax.set_xlabel("Accuracy score")
+ax.set_xlim([0, 0.7])
+_ = ax.set_title(
+    "Density of the accuracy scores for the different multiclass strategies"
+)
+
+# %%
+# At a first glance, we can see that the built-in strategy of the decision
+# tree classifier is working quite well. One-vs-one and the error-correcting
+# output code strategies are working even better. However, the
+# one-vs-rest strategy is not working as well as the other strategies.
+#
+# Indeed, these results reproduce something reported in the literature
+# as in [2]_. However, the story is not as simple as it seems.
+#
+# The importance of hyperparameters search
+# ----------------------------------------
+#
+# It was later shown in [3]_ that the multiclass strategies would show similar
+# scores if the hyperparameters of the base classifiers are first optimized.
+#
+# Here we try to reproduce such result by at least optimizing the depth of the
+# base decision tree.
+from sklearn.model_selection import GridSearchCV
+
+param_grid = {"max_depth": [3, 5, 8]}
+tree_optimized = GridSearchCV(tree, param_grid=param_grid, cv=3)
+ovo_tree = OneVsOneClassifier(tree_optimized)
+ovr_tree = OneVsRestClassifier(tree_optimized)
+ecoc = OutputCodeClassifier(tree_optimized, code_size=2)
+
+cv_results_tree = cross_validate(tree_optimized, X, y, cv=cv, n_jobs=2)
+cv_results_ovo = cross_validate(ovo_tree, X, y, cv=cv, n_jobs=2)
+cv_results_ovr = cross_validate(ovr_tree, X, y, cv=cv, n_jobs=2)
+cv_results_ecoc = cross_validate(ecoc, X, y, cv=cv, n_jobs=2)
+
+scores = pd.DataFrame(
+    {
+        "DecisionTreeClassifier": cv_results_tree["test_score"],
+        "OneVsOneClassifier": cv_results_ovo["test_score"],
+        "OneVsRestClassifier": cv_results_ovr["test_score"],
+        "OutputCodeClassifier": cv_results_ecoc["test_score"],
+    }
+)
+ax = scores.plot.kde(legend=True)
+ax.set_xlabel("Accuracy score")
+ax.set_xlim([0, 0.7])
+_ = ax.set_title(
+    "Density of the accuracy scores for the different multiclass strategies"
+)
+
+plt.show()
+
+# %%
+# We can see that once the hyperparameters are optimized, all multiclass
+# strategies have similar performance as discussed in [3]_.
+#
+# Conclusion
+# ----------
+#
+# We can get some intuition behind those results.
+#
+# First, the reason for which one-vs-one and error-correcting output code are
+# outperforming the tree when the hyperparameters are not optimized relies on
+# fact that they ensemble a larger number of classifiers. The ensembling
+# improves the generalization performance. This is a bit similar why a bagging
+# classifier generally performs better than a single decision tree if no care
+# is taken to optimize the hyperparameters.
+#
+# Then, we see the importance of optimizing the hyperparameters. Indeed, it
+# should be regularly explored when developing predictive models even if
+# techniques such as ensembling help at reducing this impact.
+#
+# Finally, it is important to recall that the estimators in scikit-learn
+# are developed with a specific strategy to handle multiclass classification
+# out of the box. So for these estimators, it means that there is no need to
+# use different strategies. These strategies are mainly useful for third-party
+# estimators supporting only binary classification. In all cases, we also show
+# that the hyperparameters should be optimized.
+#
+# References
+# ----------
+#
+#   .. [1] https://archive.ics.uci.edu/ml/datasets/Yeast
+#
+#   .. [2] `"Reducing multiclass to binary: A unifying approach for margin classifiers."
+#      Allwein, Erin L., Robert E. Schapire, and Yoram Singer.
+#      Journal of machine learning research 1
+#      Dec (2000): 113-141.
+#      <https://www.jmlr.org/papers/volume1/allwein00a/allwein00a.pdf>`_.
+#
+#   .. [3] `"In defense of one-vs-all classification."
+#      Journal of Machine Learning Research 5
+#      Jan (2004): 101-141.
+#      <https://www.jmlr.org/papers/volume5/rifkin04a/rifkin04a.pdf>`_.