Skip to content

Commit

Permalink
FIX Make decision tree pickles deterministic (scikit-learn#27580)
Browse files Browse the repository at this point in the history
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
  • Loading branch information
lesteve and ogrisel authored Oct 17, 2023
1 parent ebe4c7e commit caeb09e
Show file tree
Hide file tree
Showing 3 changed files with 33 additions and 1 deletion.
17 changes: 17 additions & 0 deletions doc/whats_new/v1.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,23 @@

.. currentmodule:: sklearn

.. _changes_1_3_2:

Version 1.3.2
=============

**October 2023**

Changelog
---------

:mod:`sklearn.tree`
...................

- |Fix| Do not leak data via non-initialized memory in decision tree pickle files and make
the generation of those files deterministic. :pr:`27580` by :user:`Loïc Estève <lesteve>`.


.. _changes_1_3_1:

Version 1.3.1
Expand Down
4 changes: 3 additions & 1 deletion sklearn/tree/_tree.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -908,11 +908,13 @@ cdef class Tree:
safe_realloc(&self.nodes, capacity)
safe_realloc(&self.value, capacity * self.value_stride)

# value memory is initialised to 0 to enable classifier argmax
if capacity > self.capacity:
# value memory is initialised to 0 to enable classifier argmax
memset(<void*>(self.value + self.capacity * self.value_stride), 0,
(capacity - self.capacity) * self.value_stride *
sizeof(float64_t))
# node memory is initialised to 0 to ensure deterministic pickle (padding in Node struct)
memset(<void*>(self.nodes + self.capacity), 0, (capacity - self.capacity) * sizeof(Node))

# if capacity smaller than node_count, adjust the counter
if capacity < self.node_count:
Expand Down
13 changes: 13 additions & 0 deletions sklearn/tree/tests/test_tree.py
Original file line number Diff line number Diff line change
Expand Up @@ -2601,3 +2601,16 @@ def test_sample_weight_non_uniform(make_data, Tree):
tree_samples_removed.fit(X[1::2, :], y[1::2])

assert_allclose(tree_samples_removed.predict(X), tree_with_sw.predict(X))


def test_deterministic_pickle():
# Non-regression test for:
# https://github.com/scikit-learn/scikit-learn/issues/27268
# Uninitialised memory would lead to the two pickle strings being different.
tree1 = DecisionTreeClassifier(random_state=0).fit(iris.data, iris.target)
tree2 = DecisionTreeClassifier(random_state=0).fit(iris.data, iris.target)

pickle1 = pickle.dumps(tree1)
pickle2 = pickle.dumps(tree2)

assert pickle1 == pickle2

0 comments on commit caeb09e

Please sign in to comment.