Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism #67

Open
wants to merge 38 commits into
base: submodulev3
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
8c09f7f
init split condition injection
SamuelCarliles3 Feb 16, 2024
ecfc9b1
wip
SamuelCarliles3 Feb 16, 2024
0c3d5c0
wip
SamuelCarliles3 Feb 16, 2024
5fd12a2
wip
SamuelCarliles3 Feb 20, 2024
b593ee0
injection progress
SamuelCarliles3 Feb 27, 2024
180fac3
injection progress
SamuelCarliles3 Feb 27, 2024
c207c3e
split injection refactoring
SamuelCarliles3 Feb 27, 2024
7cc71c1
added condition parameter passthrough prototype
SamuelCarliles3 Feb 29, 2024
2470d49
some tidying
SamuelCarliles3 Feb 29, 2024
ee3399f
more tidying
SamuelCarliles3 Feb 29, 2024
a079e4f
splitter injection refactoring
SamuelCarliles3 Mar 10, 2024
5397b66
cython injection due diligence, converted min_sample and monotonic_cs…
SamuelCarliles3 Mar 15, 2024
44f1d57
tree tests pass huzzah!
SamuelCarliles3 Mar 18, 2024
4f19d53
added some splitconditions to header
SamuelCarliles3 Mar 18, 2024
cb71be0
commented out some sample code that was substantially increasing peak…
SamuelCarliles3 Mar 21, 2024
e34be5c
added vector resize
SamuelCarliles3 Apr 9, 2024
aac802e
wip
SamuelCarliles3 Apr 10, 2024
c12f2fd
Merge branch 'submodulev3' into scarliles/splitter-injection-redux
SamuelCarliles3 Apr 15, 2024
a7f5e92
settling injection memory management for now
SamuelCarliles3 Apr 15, 2024
7a70a0b
added regression forest benchmark
SamuelCarliles3 Apr 22, 2024
d9ad68a
Merge pull request #2 from ssec-jhu/scarliles/regression-benchmark
SamuelCarliles3 Apr 22, 2024
893d588
ran black for linting check
SamuelCarliles3 Apr 23, 2024
548493c
Merge branch 'submodulev3' of github.com:ssec-jhu/scikit-learn into s…
SamuelCarliles3 Apr 23, 2024
e4b53ff
Merge branch 'submodulev3' into scarliles/regression-benchmark
SamuelCarliles3 Apr 23, 2024
089d901
Merge branch 'neurodata:submodulev3' into submodulev3
SamuelCarliles3 Apr 24, 2024
3ba5f74
Merge branch 'submodulev3' of github.com:ssec-jhu/scikit-learn into s…
SamuelCarliles3 Apr 24, 2024
cf285c1
Merge branch 'scarliles/splitter-injection-redux' into scarliles/regr…
SamuelCarliles3 Apr 24, 2024
ffc6328
Merge pull request #3 from ssec-jhu/scarliles/regression-benchmark
SamuelCarliles3 Apr 24, 2024
87c90fd
initial pass at refactoring DepthFirstTreeBuilder.build
SamuelCarliles3 May 23, 2024
51da586
some renaming to make closure pattern more obvious
SamuelCarliles3 May 28, 2024
6c117a2
added SplitRecordFactory
SamuelCarliles3 May 28, 2024
c7b675b
Merge branch 'scarliles/update-node-refactor2' into scarliles/update-…
SamuelCarliles3 May 28, 2024
9e7b131
SplitRecordFactory progress
SamuelCarliles3 May 28, 2024
a017669
build loop refactor
SamuelCarliles3 May 29, 2024
4325b0a
add_or_update tweak
SamuelCarliles3 May 29, 2024
78c3a1b
reverted to back out build body refactor
SamuelCarliles3 May 30, 2024
b8cc636
refactor baby step
SamuelCarliles3 May 30, 2024
f225658
update node refactor more baby steps
SamuelCarliles3 May 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 44 additions & 1 deletion asv_benchmarks/benchmarks/ensemble.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,58 @@
GradientBoostingClassifier,
HistGradientBoostingClassifier,
RandomForestClassifier,
RandomForestRegressor,
)

from .common import Benchmark, Estimator, Predictor
from .datasets import (
_20newsgroups_highdim_dataset,
_20newsgroups_lowdim_dataset,
_synth_classification_dataset,
_synth_regression_dataset,
_synth_regression_sparse_dataset,
)
from .utils import make_gen_classif_scorers
from .utils import make_gen_classif_scorers, make_gen_reg_scorers


class RandomForestRegressorBenchmark(Predictor, Estimator, Benchmark):
"""
Benchmarks for RandomForestRegressor.
"""

param_names = ["representation", "n_jobs"]
params = (["dense", "sparse"], Benchmark.n_jobs_vals)

def setup_cache(self):
super().setup_cache()

def make_data(self, params):
representation, n_jobs = params

if representation == "sparse":
data = _synth_regression_sparse_dataset()
else:
data = _synth_regression_dataset()

return data

def make_estimator(self, params):
representation, n_jobs = params

n_estimators = 500 if Benchmark.data_size == "large" else 100

estimator = RandomForestRegressor(
n_estimators=n_estimators,
min_samples_split=10,
max_features="log2",
n_jobs=n_jobs,
random_state=0,
)

return estimator

def make_scorers(self):
make_gen_reg_scorers(self)


class RandomForestClassifierBenchmark(Predictor, Estimator, Benchmark):
Expand Down
58 changes: 58 additions & 0 deletions sklearn/tree/_splitter.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
# Jacob Schreiber <jmschreiber91@gmail.com>
# Adam Li <adam2392@gmail.com>
# Jong Shin <jshinm@gmail.com>
# Samuel Carliles <scarlil1@jhu.edu>
#
# License: BSD 3 clause

Expand All @@ -14,9 +15,49 @@ from libcpp.vector cimport vector

from ._criterion cimport BaseCriterion, Criterion
from ._tree cimport ParentInfo

from ..utils._typedefs cimport float32_t, float64_t, intp_t, int8_t, int32_t, uint32_t


# NICE IDEAS THAT DON'T APPEAR POSSIBLE
# - accessing elements of a memory view of cython extension types in a nogil block/function
# - storing cython extension types in cpp vectors
Comment on lines +22 to +24
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# NICE IDEAS THAT DON'T APPEAR POSSIBLE
# - accessing elements of a memory view of cython extension types in a nogil block/function
# - storing cython extension types in cpp vectors
# NICE IDEAS THAT DON'T APPEAR POSSIBLE (Samuel)
# 1. accessing elements of a memory view of cython extension types in a nogil block/function
# 2. storing cython extension types in cpp vectors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to also comment on what these nice ideas are trying to accomplish. I.e. what's the problem for a new developer coming in and reading this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're simply trying to add a way of injecting functionality whose implementation details are TBD. We just want a way of saying "here's a candidate split, let me check it against any arbitrary validity constraints you may want to impose at some future date as of the time of this writing". So we want to accept a list, a memoryview, array, vector, whatever, of instantiated split constraints. Ideally the interface is a simple python one-liner, so at runtime I can just define an inline python list of constraints. But that list of constraints then needs to be executable performantly in a cython nogil block.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. Just hoping to document all the thoughts in a clean manner, so we don't lose this trains of thoughts when new developers come thru.

#
# despite the fact that we can access scalar extension type properties in such a context,
# as for instance node_split_best does with Criterion and Partition,
# and we can access the elements of a memory view of primitive types in such a context
Comment on lines +26 to +28
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't follow what you mean here. Is this related to the "nice ideas" listed above?

#
# SO WHERE DOES THAT LEAVE US
# - we can transform these into cpp vectors of structs
# and with some minor casting irritations everything else works ok
ctypedef void* SplitConditionEnv
ctypedef bint (*SplitConditionFunction)(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why does this function have to be ctypedef?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a mechanism for accepting for injection functions with a specific signature, and creating a macro for this specific type of function pointer improves readability where the type gets used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm dumb so I don't understand fully I think. For testing purposes, perhaps worth adding a unit test to demonstrate what this means? :p

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easiest way to see it is in the definition of the closure type:

https://github.com/ssec-jhu/scikit-learn/blob/f2256580d2482e607f40a938f3569f20cec95e95/sklearn/tree/_splitter.pxd#L44

With the ctypedef the closure definition is clear: there's a function pointer and a pointer to struct whose specific definition is TBD. Without the ctypedef, the closure struct definition would be very confusing.

Splitter splitter,
SplitRecord* current_split,
intp_t n_missing,
bint missing_go_to_left,
float64_t lower_bound,
float64_t upper_bound,
Comment on lines +37 to +40
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters are within the SplitRecord

SplitConditionEnv split_condition_env
) noexcept nogil

cdef struct SplitConditionClosure:
SplitConditionFunction f
SplitConditionEnv e
Comment on lines +45 to +46
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is e and f?


cdef class SplitCondition:
cdef SplitConditionClosure c
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is c here?


cdef class MinSamplesLeafCondition(SplitCondition):
pass

cdef class MinWeightLeafCondition(SplitCondition):
pass

cdef class MonotonicConstraintCondition(SplitCondition):
pass
Comment on lines +44 to +58
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we comment on what these things are?



cdef struct SplitRecord:
# Data to track sample split
intp_t feature # Which feature to split on.
Expand All @@ -30,6 +71,13 @@ cdef struct SplitRecord:
unsigned char missing_go_to_left # Controls if missing values go to the left node.
intp_t n_missing # Number of missing values for the feature being split on

ctypedef void* SplitRecordFactoryEnv
ctypedef SplitRecord* (*SplitRecordFactory)(SplitRecordFactoryEnv env) except NULL nogil

cdef struct SplitRecordFactoryClosure:
SplitRecordFactory f
SplitRecordFactoryEnv e
Comment on lines +74 to +79
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also comment here on what these things are meant to do?


cdef class BaseSplitter:
"""Abstract interface for splitter."""

Expand Down Expand Up @@ -59,6 +107,8 @@ cdef class BaseSplitter:

cdef const float64_t[:] sample_weight

cdef SplitRecordFactoryClosure split_record_factory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it named Closure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is a cython implementation of a closure. C doesn't support closures as a language level feature, but a struct of a function pointer bound with a struct of variable values functions the same.


# The samples vector `samples` is maintained by the Splitter object such
# that the samples contained in a node are contiguous. With this setting,
# `node_split` reorganizes the node samples `samples[start:end]` in two
Expand Down Expand Up @@ -90,6 +140,7 @@ cdef class BaseSplitter:
cdef void node_value(self, float64_t* dest) noexcept nogil
cdef float64_t node_impurity(self) noexcept nogil
cdef intp_t pointer_size(self) noexcept nogil
cdef SplitRecord* create_split_record(self) except NULL nogil

cdef class Splitter(BaseSplitter):
"""Base class for supervised splitters."""
Expand All @@ -105,6 +156,13 @@ cdef class Splitter(BaseSplitter):
cdef const int8_t[:] monotonic_cst
cdef bint with_monotonic_cst

cdef SplitCondition min_samples_leaf_condition
cdef SplitCondition min_weight_leaf_condition
cdef SplitCondition monotonic_constraint_condition

cdef vector[SplitConditionClosure] presplit_conditions
cdef vector[SplitConditionClosure] postsplit_conditions

cdef int init(
self,
object X,
Expand Down
Loading
Loading