Skip to content

Commit

Permalink
Merge branch 'scikit-learn:main' into submodulev3
Browse files Browse the repository at this point in the history
  • Loading branch information
adam2392 authored Oct 31, 2023
2 parents f5c68b0 + a5fed0d commit 5211846
Show file tree
Hide file tree
Showing 19 changed files with 722 additions and 143 deletions.
2 changes: 1 addition & 1 deletion doc/metadata_routing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,7 @@ Meta-estimators and functions supporting metadata routing:

- :class:`sklearn.calibration.CalibratedClassifierCV`
- :class:`sklearn.compose.ColumnTransformer`
- :class:`sklearn.feature_selection.SelectFromModel`
- :class:`sklearn.linear_model.ElasticNetCV`
- :class:`sklearn.linear_model.LarsCV`
- :class:`sklearn.linear_model.LassoCV`
Expand Down Expand Up @@ -290,7 +291,6 @@ Meta-estimators and tools not supporting metadata routing yet:
- :class:`sklearn.ensemble.VotingRegressor`
- :class:`sklearn.feature_selection.RFE`
- :class:`sklearn.feature_selection.RFECV`
- :class:`sklearn.feature_selection.SelectFromModel`
- :class:`sklearn.feature_selection.SequentialFeatureSelector`
- :class:`sklearn.impute.IterativeImputer`
- :class:`sklearn.linear_model.RANSACRegressor`
Expand Down
8 changes: 7 additions & 1 deletion doc/modules/feature_selection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,13 @@ repeated on the pruned set until the desired number of features to select is
eventually reached.

:class:`RFECV` performs RFE in a cross-validation loop to find the optimal
number of features.
number of features. In more details, the number of features selected is tuned
automatically by fitting an :class:`RFE` selector on the different
cross-validation splits (provided by the `cv` parameter). The performance
of the :class:`RFE` selector are evaluated using `scorer` for different number
of selected features and aggregated together. Finally, the scores are averaged
across folds and the number of features selected is set to the number of
features that maximize the cross-validation score.

.. topic:: Examples:

Expand Down
21 changes: 18 additions & 3 deletions doc/whats_new/v1.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ random sampling procedures.
solvers (when fit on the same data again). The amount of change depends on the
specified `tol`, for small values you will get more precise results.

- |Fix| fixes a memory leak seen in PyPy for estimators using the Cython loss functions.
:pr:`27670` by :user:`Guillaume Lemaitre <glemaitre>`.

Changes impacting all modules
-----------------------------

Expand Down Expand Up @@ -107,6 +110,10 @@ more details.
``**score_params`` which are passed to the underlying scorer.
:pr:`26525` by :user:`Omar Salman <OmarManzoor>`.

- |Feature| :class:`feature_selection.SelectFromModel` now supports metadata
routing in `fit` and `partial_fit`.
:pr:`27490` by :user:`Stefanie Senger <StefanieSenger>`.

- |Feature| :class:`linear_model.OrthogonalMatchingPursuitCV` now supports
metadata routing. Its `fit` now accepts ``**fit_params``, which are passed to
the underlying splitter. :pr:`27500` by :user:`Stefanie Senger
Expand All @@ -130,8 +137,8 @@ and classes are impacted:

**Functions:**

- :func:`cluster.compute_optics_graph` in :pr:`27250` by
:user:`Yao Xiao <Charlie-XIAO>`;
- :func:`cluster.compute_optics_graph` in :pr:`27104` by
:user:`Maren Westermann <marenwestermann>` and in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
- :func:`cluster.kmeans_plusplus` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`;
- :func:`decomposition.non_negative_factorization` in :pr:`27100` by
:user:`Isaac Virshup <ivirshup>`;
Expand All @@ -153,7 +160,8 @@ and classes are impacted:
- :class:`cluster.HDBSCAN` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
- :class:`cluster.KMeans` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`;
- :class:`cluster.MiniBatchKMeans` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`;
- :class:`cluster.OPTICS` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
- :class:`cluster.OPTICS` in :pr:`27104` by
:user:`Maren Westermann <marenwestermann>` and in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
- :class:`decomposition.NMF` in :pr:`27100` by :user:`Isaac Virshup <ivirshup>`;
- :class:`decomposition.MiniBatchNMF` in :pr:`27100` by
:user:`Isaac Virshup <ivirshup>`;
Expand Down Expand Up @@ -467,6 +475,13 @@ Changelog
which can be used to check whether a given set of parameters would be consumed.
:pr:`26831` by `Adrin Jalali`_.

- |Enhancement| Make :func:`sklearn.utils.check_array` attempt to output
`int32`-indexed CSR and COO arrays when converting from DIA arrays if the number of
non-zero entries is small enough. This ensures that estimators implemented in Cython
and that do not accept `int64`-indexed sparse datastucture, now consistently
accept the same sparse input formats for SciPy sparse matrices and arrays.
:pr:`27372` by :user:`Guillaume Lemaitre <glemaitre>`.

- |Fix| :func:`sklearn.utils.check_array` should accept both matrix and array from
the sparse SciPy module. The previous implementation would fail if `copy=True` by
calling specific NumPy `np.may_share_memory` that does not work with SciPy sparse
Expand Down
6 changes: 6 additions & 0 deletions examples/multiclass/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.. _multiclass_examples:

Multiclass methods
------------------

Examples concerning the :mod:`sklearn.multiclass` module.
201 changes: 201 additions & 0 deletions examples/multiclass/plot_multiclass_overview.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
"""
===============================================
Overview of multiclass training meta-estimators
===============================================
In this example, we discuss the problem of classification when the target
variable is composed of more than two classes. This is called multiclass
classification.
In scikit-learn, all estimators support multiclass classification out of the
box: the most sensible strategy was implemented for the end-user. The
:mod:`sklearn.multiclass` module implements various strategies that one can use
for experimenting or developing third-party estimators that only support binary
classification.
:mod:`sklearn.multiclass` includes OvO/OvR strategies used to train a
multiclass classifier by fitting a set of binary classifiers (the
:class:`~sklearn.multiclass.OneVsOneClassifier` and
:class:`~sklearn.multiclass.OneVsRestClassifier` meta-estimators). This example
will review them.
"""

# %%
# The Yeast UCI dataset
# ---------------------
#
# In this example, we use a UCI dataset [1]_, generally referred as the Yeast
# dataset. We use the :func:`sklearn.datasets.fetch_openml` function to load
# the dataset from OpenML.
from sklearn.datasets import fetch_openml

X, y = fetch_openml(data_id=181, as_frame=True, return_X_y=True, parser="pandas")

# %%
# To know the type of data science problem we are dealing with, we can check
# the target for which we want to build a predictive model.
y.value_counts().sort_index()

# %%
# We see that the target is discrete and composed of 10 classes. We therefore
# deal with a multiclass classification problem.
#
# Strategies comparison
# ---------------------
#
# In the following experiment, we use a
# :class:`~sklearn.tree.DecisionTreeClassifier` and a
# :class:`~sklearn.model_selection.RepeatedStratifiedKFold` cross-validation
# with 3 splits and 5 repetitions.
#
# We compare the following strategies:
#
# * :class:~sklearn.tree.DecisionTreeClassifier can handle multiclass
# classification without needing any special adjustments. It works by breaking
# down the training data into smaller subsets and focusing on the most common
# class in each subset. By repeating this process, the model can accurately
# classify input data into multiple different classes.
# * :class:`~sklearn.multiclass.OneVsOneClassifier` trains a set of binary
# classifiers where each classifier is trained to distinguish between
# two classes.
# * :class:`~sklearn.multiclass.OneVsRestClassifier`: trains a set of binary
# classifiers where each classifier is trained to distinguish between
# one class and the rest of the classes.
# * :class:`~sklearn.multiclass.OutputCodeClassifier`: trains a set of binary
# classifiers where each classifier is trained to distinguish between
# a set of classes from the rest of the classes. The set of classes is
# defined by a codebook, which is randomly generated in scikit-learn. This
# method exposes a parameter `code_size` to control the size of the codebook.
# We set it above one since we are not interested in compressing the class
# representation.
import pandas as pd

from sklearn.model_selection import RepeatedStratifiedKFold, cross_validate
from sklearn.multiclass import (
OneVsOneClassifier,
OneVsRestClassifier,
OutputCodeClassifier,
)
from sklearn.tree import DecisionTreeClassifier

cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=5, random_state=0)

tree = DecisionTreeClassifier(random_state=0)
ovo_tree = OneVsOneClassifier(tree)
ovr_tree = OneVsRestClassifier(tree)
ecoc = OutputCodeClassifier(tree, code_size=2)

cv_results_tree = cross_validate(tree, X, y, cv=cv, n_jobs=2)
cv_results_ovo = cross_validate(ovo_tree, X, y, cv=cv, n_jobs=2)
cv_results_ovr = cross_validate(ovr_tree, X, y, cv=cv, n_jobs=2)
cv_results_ecoc = cross_validate(ecoc, X, y, cv=cv, n_jobs=2)

# %%
# We can now compare the statistical performance of the different strategies.
# We plot the score distribution of the different strategies.
from matplotlib import pyplot as plt

scores = pd.DataFrame(
{
"DecisionTreeClassifier": cv_results_tree["test_score"],
"OneVsOneClassifier": cv_results_ovo["test_score"],
"OneVsRestClassifier": cv_results_ovr["test_score"],
"OutputCodeClassifier": cv_results_ecoc["test_score"],
}
)
ax = scores.plot.kde(legend=True)
ax.set_xlabel("Accuracy score")
ax.set_xlim([0, 0.7])
_ = ax.set_title(
"Density of the accuracy scores for the different multiclass strategies"
)

# %%
# At a first glance, we can see that the built-in strategy of the decision
# tree classifier is working quite well. One-vs-one and the error-correcting
# output code strategies are working even better. However, the
# one-vs-rest strategy is not working as well as the other strategies.
#
# Indeed, these results reproduce something reported in the literature
# as in [2]_. However, the story is not as simple as it seems.
#
# The importance of hyperparameters search
# ----------------------------------------
#
# It was later shown in [3]_ that the multiclass strategies would show similar
# scores if the hyperparameters of the base classifiers are first optimized.
#
# Here we try to reproduce such result by at least optimizing the depth of the
# base decision tree.
from sklearn.model_selection import GridSearchCV

param_grid = {"max_depth": [3, 5, 8]}
tree_optimized = GridSearchCV(tree, param_grid=param_grid, cv=3)
ovo_tree = OneVsOneClassifier(tree_optimized)
ovr_tree = OneVsRestClassifier(tree_optimized)
ecoc = OutputCodeClassifier(tree_optimized, code_size=2)

cv_results_tree = cross_validate(tree_optimized, X, y, cv=cv, n_jobs=2)
cv_results_ovo = cross_validate(ovo_tree, X, y, cv=cv, n_jobs=2)
cv_results_ovr = cross_validate(ovr_tree, X, y, cv=cv, n_jobs=2)
cv_results_ecoc = cross_validate(ecoc, X, y, cv=cv, n_jobs=2)

scores = pd.DataFrame(
{
"DecisionTreeClassifier": cv_results_tree["test_score"],
"OneVsOneClassifier": cv_results_ovo["test_score"],
"OneVsRestClassifier": cv_results_ovr["test_score"],
"OutputCodeClassifier": cv_results_ecoc["test_score"],
}
)
ax = scores.plot.kde(legend=True)
ax.set_xlabel("Accuracy score")
ax.set_xlim([0, 0.7])
_ = ax.set_title(
"Density of the accuracy scores for the different multiclass strategies"
)

plt.show()

# %%
# We can see that once the hyperparameters are optimized, all multiclass
# strategies have similar performance as discussed in [3]_.
#
# Conclusion
# ----------
#
# We can get some intuition behind those results.
#
# First, the reason for which one-vs-one and error-correcting output code are
# outperforming the tree when the hyperparameters are not optimized relies on
# fact that they ensemble a larger number of classifiers. The ensembling
# improves the generalization performance. This is a bit similar why a bagging
# classifier generally performs better than a single decision tree if no care
# is taken to optimize the hyperparameters.
#
# Then, we see the importance of optimizing the hyperparameters. Indeed, it
# should be regularly explored when developing predictive models even if
# techniques such as ensembling help at reducing this impact.
#
# Finally, it is important to recall that the estimators in scikit-learn
# are developed with a specific strategy to handle multiclass classification
# out of the box. So for these estimators, it means that there is no need to
# use different strategies. These strategies are mainly useful for third-party
# estimators supporting only binary classification. In all cases, we also show
# that the hyperparameters should be optimized.
#
# References
# ----------
#
# .. [1] https://archive.ics.uci.edu/ml/datasets/Yeast
#
# .. [2] `"Reducing multiclass to binary: A unifying approach for margin classifiers."
# Allwein, Erin L., Robert E. Schapire, and Yoram Singer.
# Journal of machine learning research 1
# Dec (2000): 113-141.
# <https://www.jmlr.org/papers/volume1/allwein00a/allwein00a.pdf>`_.
#
# .. [3] `"In defense of one-vs-all classification."
# Journal of Machine Learning Research 5
# Jan (2004): 101-141.
# <https://www.jmlr.org/papers/volume5/rifkin04a/rifkin04a.pdf>`_.
Loading

0 comments on commit 5211846

Please sign in to comment.