-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism #67
base: submodulev3
Are you sure you want to change the base?
Changes from all commits
8c09f7f
ecfc9b1
0c3d5c0
5fd12a2
b593ee0
180fac3
c207c3e
7cc71c1
2470d49
ee3399f
a079e4f
5397b66
44f1d57
4f19d53
cb71be0
e34be5c
aac802e
c12f2fd
a7f5e92
7a70a0b
d9ad68a
893d588
548493c
e4b53ff
089d901
3ba5f74
cf285c1
ffc6328
87c90fd
51da586
6c117a2
c7b675b
9e7b131
a017669
4325b0a
78c3a1b
b8cc636
f225658
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ | |
# Jacob Schreiber <jmschreiber91@gmail.com> | ||
# Adam Li <adam2392@gmail.com> | ||
# Jong Shin <jshinm@gmail.com> | ||
# Samuel Carliles <scarlil1@jhu.edu> | ||
# | ||
# License: BSD 3 clause | ||
|
||
|
@@ -14,9 +15,49 @@ from libcpp.vector cimport vector | |
|
||
from ._criterion cimport BaseCriterion, Criterion | ||
from ._tree cimport ParentInfo | ||
|
||
from ..utils._typedefs cimport float32_t, float64_t, intp_t, int8_t, int32_t, uint32_t | ||
|
||
|
||
# NICE IDEAS THAT DON'T APPEAR POSSIBLE | ||
# - accessing elements of a memory view of cython extension types in a nogil block/function | ||
# - storing cython extension types in cpp vectors | ||
# | ||
# despite the fact that we can access scalar extension type properties in such a context, | ||
# as for instance node_split_best does with Criterion and Partition, | ||
# and we can access the elements of a memory view of primitive types in such a context | ||
Comment on lines
+26
to
+28
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't follow what you mean here. Is this related to the "nice ideas" listed above? |
||
# | ||
# SO WHERE DOES THAT LEAVE US | ||
# - we can transform these into cpp vectors of structs | ||
# and with some minor casting irritations everything else works ok | ||
ctypedef void* SplitConditionEnv | ||
ctypedef bint (*SplitConditionFunction)( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Out of curiosity, why does this function have to be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added a mechanism for accepting for injection functions with a specific signature, and creating a macro for this specific type of function pointer improves readability where the type gets used. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm dumb so I don't understand fully I think. For testing purposes, perhaps worth adding a unit test to demonstrate what this means? :p There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Easiest way to see it is in the definition of the closure type: With the ctypedef the closure definition is clear: there's a function pointer and a pointer to struct whose specific definition is TBD. Without the ctypedef, the closure struct definition would be very confusing. |
||
Splitter splitter, | ||
SplitRecord* current_split, | ||
intp_t n_missing, | ||
bint missing_go_to_left, | ||
float64_t lower_bound, | ||
float64_t upper_bound, | ||
Comment on lines
+37
to
+40
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These parameters are within the |
||
SplitConditionEnv split_condition_env | ||
) noexcept nogil | ||
|
||
cdef struct SplitConditionClosure: | ||
SplitConditionFunction f | ||
SplitConditionEnv e | ||
Comment on lines
+45
to
+46
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is |
||
|
||
cdef class SplitCondition: | ||
cdef SplitConditionClosure c | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is |
||
|
||
cdef class MinSamplesLeafCondition(SplitCondition): | ||
pass | ||
|
||
cdef class MinWeightLeafCondition(SplitCondition): | ||
pass | ||
|
||
cdef class MonotonicConstraintCondition(SplitCondition): | ||
pass | ||
Comment on lines
+44
to
+58
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we comment on what these things are? |
||
|
||
|
||
cdef struct SplitRecord: | ||
# Data to track sample split | ||
intp_t feature # Which feature to split on. | ||
|
@@ -30,6 +71,13 @@ cdef struct SplitRecord: | |
unsigned char missing_go_to_left # Controls if missing values go to the left node. | ||
intp_t n_missing # Number of missing values for the feature being split on | ||
|
||
ctypedef void* SplitRecordFactoryEnv | ||
ctypedef SplitRecord* (*SplitRecordFactory)(SplitRecordFactoryEnv env) except NULL nogil | ||
|
||
cdef struct SplitRecordFactoryClosure: | ||
SplitRecordFactory f | ||
SplitRecordFactoryEnv e | ||
Comment on lines
+74
to
+79
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we also comment here on what these things are meant to do? |
||
|
||
cdef class BaseSplitter: | ||
"""Abstract interface for splitter.""" | ||
|
||
|
@@ -59,6 +107,8 @@ cdef class BaseSplitter: | |
|
||
cdef const float64_t[:] sample_weight | ||
|
||
cdef SplitRecordFactoryClosure split_record_factory | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is it named Closure? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because it is a cython implementation of a closure. C doesn't support closures as a language level feature, but a struct of a function pointer bound with a struct of variable values functions the same. |
||
|
||
# The samples vector `samples` is maintained by the Splitter object such | ||
# that the samples contained in a node are contiguous. With this setting, | ||
# `node_split` reorganizes the node samples `samples[start:end]` in two | ||
|
@@ -90,6 +140,7 @@ cdef class BaseSplitter: | |
cdef void node_value(self, float64_t* dest) noexcept nogil | ||
cdef float64_t node_impurity(self) noexcept nogil | ||
cdef intp_t pointer_size(self) noexcept nogil | ||
cdef SplitRecord* create_split_record(self) except NULL nogil | ||
|
||
cdef class Splitter(BaseSplitter): | ||
"""Base class for supervised splitters.""" | ||
|
@@ -105,6 +156,13 @@ cdef class Splitter(BaseSplitter): | |
cdef const int8_t[:] monotonic_cst | ||
cdef bint with_monotonic_cst | ||
|
||
cdef SplitCondition min_samples_leaf_condition | ||
cdef SplitCondition min_weight_leaf_condition | ||
cdef SplitCondition monotonic_constraint_condition | ||
|
||
cdef vector[SplitConditionClosure] presplit_conditions | ||
cdef vector[SplitConditionClosure] postsplit_conditions | ||
|
||
cdef int init( | ||
self, | ||
object X, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to also comment on what these nice ideas are trying to accomplish. I.e. what's the problem for a new developer coming in and reading this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're simply trying to add a way of injecting functionality whose implementation details are TBD. We just want a way of saying "here's a candidate split, let me check it against any arbitrary validity constraints you may want to impose at some future date as of the time of this writing". So we want to accept a list, a memoryview, array, vector, whatever, of instantiated split constraints. Ideally the interface is a simple python one-liner, so at runtime I can just define an inline python list of constraints. But that list of constraints then needs to be executable performantly in a cython nogil block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. Just hoping to document all the thoughts in a clean manner, so we don't lose this trains of thoughts when new developers come thru.