-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design and implementation of random forest nearest neighbors (RF-NN) #24
Comments
Thanks for an awesome write-up, @grovduck! This is a huge help to get me up to speed on the topic. I suspect this will be a long discussion and I'll have more questions as I get a better understanding of the details, but a few things occur to me off the bat. To begin with, am I right in assuming that we want to be able to accurately reproduce the results of RF-NN as implemented in
Is the conversion from continuous to categorical targets up to the user? Otherwise, that seems challenging to handle flexibly with an automated approach. Should we consider offering a
My first thought is that
Makes sense to me!
This is a very interesting idea. Given that there is a forest for each target, it does seem like there's a strong potential for poorly performing forests to add a lot of noise to predictions. Would it make sense to weight forests based on an accuracy metric, such that targets that aren't predicted well will contribute less to the neighbor selection? Would this be handled through an optional argument on |
Note: remove the "in development" disclaimer from the README once |
@aazuspan, yes, back from the dead! I think I mentioned to you that I've been doing a bit of thinking on this one locally with no real code to show yet. The very first step was to figure out whether we could replicate the output of
In terms of ensuring "correctness" for porting, I think we have a couple of options:
Honestly, 3. seems like a lot of work for something that we're worried about in the short-term (correctly porting |
@aazuspan, following up on some long-neglected questions that you asked:
Yes, with the huge caveat I've raised above!
This one took a bit of digging, but I think I understand it a bit better now. First off, it's possible to build forests using regression if the version of If
Yep, I like
Let's pursue this one in a separate issue/PR. It's obviously taken me long enough to get any momentum for this first part! Thanks for your great feedback. |
@grovduck, thanks for the detailed rundown! I agree with your hesitation around tying our implementation to an R dependency - that seems like a recipe for maintenance and installation headaches. I can see how that does complicate the prospect of validating our implementation though. I think you covered all the options well, so I don't have anything very useful to add, but in my opinion option 2 sounds like a good compromise that builds some confidence in our implementation without requiring a lot of extra, throwaway effort. For the "correlation check", were you picturing that as a one-time exercise for our peace of mind, or something that would end up in the port tests? If we could get away with just the former, that would be great, but maybe it would require implementing regression tests in #42 first to ensure we don't break it down the line?
Totally agreed. We could put a caveat in our documentation to explain that there are potential differences from yaImpute due to the different random forest implementations.
Great sleuthing! I like the idea of starting with the simplest viable approach, even if it doesn't match the yaImpute default behavior. With that said, the hybrid approach sounds like it would have some complexity as well if we need to identify continuous vs. categorical targets... Any thoughts on how best to accomplish that? I could see using
Sounds reasonable! |
@aazuspan, as I dug into this a bit more (and recognizing my own propensity for being absolutely certain!), I realized that the straight port using
|
We started to discuss the design and implementation of RF-NN in #22, but thought it would be better to track in a separate issue. Because RF-NN is quite different than all other estimators in this package in terms of how distances are calculated, we'll need to likely use different
sklearn
base classes or derive our own to implement this. The goal of this issue is to lay out the design of RF-NN and decide how best to tackle it. As RF-NN was first introduced in the RyaImpute
package, we rely heavily on their implementation details.Fitting RF-NN models
One or more
y
attributes (along with multipleX
covariates) are used to fit different random forests, notably one forest per eachy
attribute. InyaImpute
, each forest is actually a classification (i.e.RandomForestClassifier
) - they
attributes passed are either categorical attributes (something like vegetation class) or continuous attributes that are binned into classes using some classification mode (equal interval, quantile, natural breaks, etc.).Somewhat non-intuitively, in RF-NN, the actual values of the terminal nodes (i.e. the class values) in each random forest doesn't matter, as we only care about the terminal nodes' IDs of the forests' trees where the references land. For example, if there are
q
y
attributes passed,q
different forests ofn
trees will be created. For each reference observation passed tofit
, we want to capture the terminal node ID that it falls into for each tree and each forest. Therefore, the needed output of thefit
process is both the specification of the forests that were created (to run additional targets) and the "node matrix" ofp
rows (reference plots) byq * n
columns that stores node IDS. We'll be leaning on the apply method of each forest to return the leaf node IDs.Predicting targets using node similarity
Functionally, the
predict
process for each new target is very similar to standard random forests, although the target will be run through allq
forests rather than just a single forest. As with the references, we only use the IDs of the terminal nodes that the target falls into - therefore, we get a(q * n)
vector of node IDs as a result of passing a target through all forests. At this point, we use a node similarity metric to define the distances from the target to each candidate reference plot. As opposed to all other estimators that we've introduced so far which have a neighbor space greater than one dimension, distances in RF-NN are calculated as the inverse of the number of nodes that the reference and target share in common. Neighbors are ranked on this distance to find thek
nearest neighbors for each target.Estimator naming
An initial suggestion for the naming of this estimator would be
RandomForestNNRegressor
. We could also go withRFNNRegressor
as the termRFNN
is somewhat well understood by those in the imputation community. To be explict, I propose the former name and use it in subsequent discussion.Design considerations
fit
We can rely on
RandomForestClassifier
to create the forests for eachy
attribute passed. As of now, it seems reasonable to haveRandomForestNNRegressor
composed ofq
RandomForestClassifier
s rather than inherit from it.fit
would introduce two estimator attributes: 1) the forests themselves (proposed name:forests_
); and 2) the node matrix of the data that was used to fit the model (proposed name:node_matrix_
). We'll also likely want to have easy access to a few counts, e.g. number of forests and number of trees in a forest. These could easily be properties of the class derived from information stored inself.forests_
.kneighbors
We will likely need to introduce a new method with the same signature as estimators derived from
TransformedKNeighborsMixin
, i.e.def kneighbors(self, X=None, n_neighbors=None, return_distance=True)
. The method would runX
throughself.forests_
, capture a node ID matrix forX
and then row-wise calculate distances fromself.node_matrix
. The return value would be the (optional) distances and row indexes of then_neighbors
. Note thatn_neighbors
will be set on the initialization ofRandomForestNNRegressor
just like our other estimators, butkneighbors
can override this by passing a value ton_neighbors
.predict
Initially we had thought we may possibly be able to derive from
KNeighborsRegressor
if we use the callable weights parameter, because given weights andy
attributes, the actual calculation of predicted attributes will be exactly the same as in all estimators that derive fromTransformedKNeighborsMixin
. However, in looking at that implementation more closely,predict
still relies on calculating the n-dimensional distances and then applying the callable function to the distance array. We are probably better off by introducing a newpredict
function that is simplynp.average
with a weights parameter that could represent the inverse distances calculated fromkneighbors
.Timing
As this estimator deviates substantially from the other estimators already introduced, I propose we first merge
fb_add_estimators
back intomain
and treat this as a separate feature branch.Other possible enhancements
yaImpute
.GNNRegressor
andMSNRegressor
) apply dimensionality reduction such that the resultant axes are ordered based on the amount of variation explained. The eigenvalues from these axes are typically used to weight these axes when finding neighbors. As currently written, RF-NN weights all forests the same (as they are captured in a single node matrix). There may be a measure of forest importance (akin to variable importance in the forest themselves) that may provide weights when calculating distances. I've not seen anything to capture this, but also have not looked very closely at all diagnostics fromRandomForestClassifier
.The text was updated successfully, but these errors were encountered: