Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design and implementation of random forest nearest neighbors (RF-NN) #24

Open
grovduck opened this issue May 24, 2023 · 6 comments · May be fixed by #85
Open

Design and implementation of random forest nearest neighbors (RF-NN) #24

grovduck opened this issue May 24, 2023 · 6 comments · May be fixed by #85
Labels
enhancement New feature or request estimator Related to one or more estimators

Comments

@grovduck
Copy link
Member

We started to discuss the design and implementation of RF-NN in #22, but thought it would be better to track in a separate issue. Because RF-NN is quite different than all other estimators in this package in terms of how distances are calculated, we'll need to likely use different sklearn base classes or derive our own to implement this. The goal of this issue is to lay out the design of RF-NN and decide how best to tackle it. As RF-NN was first introduced in the R yaImpute package, we rely heavily on their implementation details.

Fitting RF-NN models

One or more y attributes (along with multiple X covariates) are used to fit different random forests, notably one forest per each y attribute. In yaImpute, each forest is actually a classification (i.e. RandomForestClassifier) - the y attributes passed are either categorical attributes (something like vegetation class) or continuous attributes that are binned into classes using some classification mode (equal interval, quantile, natural breaks, etc.).

Somewhat non-intuitively, in RF-NN, the actual values of the terminal nodes (i.e. the class values) in each random forest doesn't matter, as we only care about the terminal nodes' IDs of the forests' trees where the references land. For example, if there are q y attributes passed, q different forests of n trees will be created. For each reference observation passed to fit, we want to capture the terminal node ID that it falls into for each tree and each forest. Therefore, the needed output of the fit process is both the specification of the forests that were created (to run additional targets) and the "node matrix" of p rows (reference plots) by q * n columns that stores node IDS. We'll be leaning on the apply method of each forest to return the leaf node IDs.

Predicting targets using node similarity

Functionally, the predict process for each new target is very similar to standard random forests, although the target will be run through all q forests rather than just a single forest. As with the references, we only use the IDs of the terminal nodes that the target falls into - therefore, we get a (q * n) vector of node IDs as a result of passing a target through all forests. At this point, we use a node similarity metric to define the distances from the target to each candidate reference plot. As opposed to all other estimators that we've introduced so far which have a neighbor space greater than one dimension, distances in RF-NN are calculated as the inverse of the number of nodes that the reference and target share in common. Neighbors are ranked on this distance to find the k nearest neighbors for each target.

Estimator naming

An initial suggestion for the naming of this estimator would be RandomForestNNRegressor. We could also go with RFNNRegressor as the term RFNN is somewhat well understood by those in the imputation community. To be explict, I propose the former name and use it in subsequent discussion.

Design considerations

fit

We can rely on RandomForestClassifier to create the forests for each y attribute passed. As of now, it seems reasonable to have RandomForestNNRegressor composed of q RandomForestClassifiers rather than inherit from it. fit would introduce two estimator attributes: 1) the forests themselves (proposed name: forests_); and 2) the node matrix of the data that was used to fit the model (proposed name: node_matrix_). We'll also likely want to have easy access to a few counts, e.g. number of forests and number of trees in a forest. These could easily be properties of the class derived from information stored in self.forests_.

kneighbors

We will likely need to introduce a new method with the same signature as estimators derived from TransformedKNeighborsMixin, i.e. def kneighbors(self, X=None, n_neighbors=None, return_distance=True). The method would run X through self.forests_, capture a node ID matrix for X and then row-wise calculate distances from self.node_matrix. The return value would be the (optional) distances and row indexes of the n_neighbors. Note that n_neighbors will be set on the initialization of RandomForestNNRegressor just like our other estimators, but kneighbors can override this by passing a value to n_neighbors.

predict

Initially we had thought we may possibly be able to derive from KNeighborsRegressor if we use the callable weights parameter, because given weights and y attributes, the actual calculation of predicted attributes will be exactly the same as in all estimators that derive from TransformedKNeighborsMixin. However, in looking at that implementation more closely, predict still relies on calculating the n-dimensional distances and then applying the callable function to the distance array. We are probably better off by introducing a new predict function that is simply np.average with a weights parameter that could represent the inverse distances calculated from kneighbors.

Timing

As this estimator deviates substantially from the other estimators already introduced, I propose we first merge fb_add_estimators back into main and treat this as a separate feature branch.

Other possible enhancements

  1. Rather than capturing a flat node matrix that is the combination of all forests and all trees within forests, we could capture a node matrix for each forest such that it behaves like an "axis", more akin to our other estimators. Distances could then be calculated per axis and n-dimensional Euclidean distance could yield neighbors. I believe this would be a new twist on RF-NN that is not implemented in yaImpute.
  2. Many of the other estimators (e.g. GNNRegressor and MSNRegressor) apply dimensionality reduction such that the resultant axes are ordered based on the amount of variation explained. The eigenvalues from these axes are typically used to weight these axes when finding neighbors. As currently written, RF-NN weights all forests the same (as they are captured in a single node matrix). There may be a measure of forest importance (akin to variable importance in the forest themselves) that may provide weights when calculating distances. I've not seen anything to capture this, but also have not looked very closely at all diagnostics from RandomForestClassifier.
@grovduck grovduck added enhancement New feature or request estimator Related to one or more estimators labels May 24, 2023
@aazuspan
Copy link
Contributor

Thanks for an awesome write-up, @grovduck! This is a huge help to get me up to speed on the topic. I suspect this will be a long discussion and I'll have more questions as I get a better understanding of the details, but a few things occur to me off the bat.

To begin with, am I right in assuming that we want to be able to accurately reproduce the results of RF-NN as implemented in yaImpute? In other words, no changes to the basic algorithm?

In yaImpute, each forest is actually a classification (i.e. RandomForestClassifier) - the y attributes passed are either categorical attributes (something like vegetation class) or continuous attributes that are binned into classes using some classification mode (equal interval, quantile, natural breaks, etc.).

Is the conversion from continuous to categorical targets up to the user? Otherwise, that seems challenging to handle flexibly with an automated approach. Should we consider offering a RandomForestRegressor based option that uses continuous targets? Or maybe a hybrid option that uses classifier forests for categorical and regressor forests for continuous targets? I'm not sure what the API would look like for that, but just a thought.

RandomForestNNRegressor. We could also go with RFNNRegressor as the term RFNN is somewhat well understood by those in the imputation community. To be explict, I propose the former name and use it in subsequent discussion.

My first thought is that RFNNRegressor is consistent with the slightly-less explicit direction we use for GNNRegressor and MSNRegressor, but I don't mind RandomForestNNRegressor either.

As this estimator deviates substantially from the other estimators already introduced, I propose we first merge fb_add_estimators back into main and treat this as a separate feature branch.

Makes sense to me!

As currently written, RF-NN weights all forests the same (as they are captured in a single node matrix). There may be a measure of forest importance (akin to variable importance in the forest themselves) that may provide weights when calculating distances.

This is a very interesting idea. Given that there is a forest for each target, it does seem like there's a strong potential for poorly performing forests to add a lot of noise to predictions. Would it make sense to weight forests based on an accuracy metric, such that targets that aren't predicted well will contribute less to the neighbor selection? Would this be handled through an optional argument on predict, or an initialization parameter of the estimator, or something else?

@grovduck grovduck added this to the Complete estimators milestone Sep 27, 2023
@aazuspan
Copy link
Contributor

aazuspan commented Oct 5, 2023

Note: remove the "in development" disclaimer from the README once RFNN is merged.

https://github.com/lemma-osu/scikit-learn-knn-regression/blob/cc7795796f02cda465dc110c40f83708884ac8b2/README.md?plain=1#L86

@grovduck
Copy link
Member Author

@aazuspan, yes, back from the dead! I think I mentioned to you that I've been doing a bit of thinking on this one locally with no real code to show yet. The very first step was to figure out whether we could replicate the output of yaImpute for this estimator given the randomization in building the trees in a forest. After way-too-much deliberation, I think the answer is a qualified "no" based on this understanding:

  1. I think it is possible to port the code exactly if we were to use rpy2 to call R's randomForest directly. I am very hesitant to go down that path given that it adds a dependency that seems to me to be notoriously difficult to work with on Windows not to mention requiring users to have R. Additionally, I worry about performance calling R to do the prediction from those trees - other collaborators have noted that RF-NN is very slow.
  2. We can't trust that scikit-learn's implementation of RandomForestClassifier will use the same internal algorithm as randomForest. I did a very quick test with the iris dataset using similar hyperparameters:
    • ntree =n_estimators =100
    • mtry = max_features = 2
    • random seed set to 42
      The trees that came out split on different features with different thresholds, so I've convinced myself they are different enough.
  3. I had thought about fitting through rpy2/randomForest and then somehow translating the resultant trees/forests for use with DecisionTreeClassifier and then predicting from there, but ChatGPT convinced me that this path was full of pain and we don't get away from many of the issues noted in 1.

In terms of ensuring "correctness" for porting, I think we have a couple of options:

  1. Implement exclusively using scikit-learn and don't worry about matching IDs and distances
  2. Do 1. but look for high correlations between y predicted values between test data (from yaImpute) and sknnr's output
  3. Create a more-or-less throwaway implementation for both fit and predict that uses the rpy2 route (detailed in 1. above) and test both neighbors and distances as before. This would test that we are doing the distance and neighbor finding correctly, so we would only have a single implementation of the kneighbors method.

Honestly, 3. seems like a lot of work for something that we're worried about in the short-term (correctly porting yaImpute), but it is the most rigorous. You always have good ideas that I haven't thought of, so I'm curious to hear your perspective!

@grovduck
Copy link
Member Author

@aazuspan, following up on some long-neglected questions that you asked:

To begin with, am I right in assuming that we want to be able to accurately reproduce the results of RF-NN as implemented in yaImpute? In other words, no changes to the basic algorithm?

Yes, with the huge caveat I've raised above!

Is the conversion from continuous to categorical targets up to the user? Otherwise, that seems challenging to handle flexibly with an automated approach. Should we consider offering a RandomForestRegressor based option that uses continuous targets? Or maybe a hybrid option that uses classifier forests for categorical and regressor forests for continuous targets? I'm not sure what the API would look like for that, but just a thought.

This one took a bit of digging, but I think I understand it a bit better now. First off, it's possible to build forests using regression if the version of randomForest is 4.5-19 or later and rfMode is not set to "buildClasses" (the default and, hence, I was overlooking it).

If rfMode = "buildClases", any continuous y feature is split into classes using the pretty R function after determining the number of classes using the nclass.Sturges R function. We can probably port these over to match. But I also like your idea of the hybrid option which is how yaImpute behaves if rfMode is not set to "buildClasses". Perhaps we start with this and support the "build automatic classes" at a later time?

My first thought is that RFNNRegressor is consistent with the slightly-less explicit direction we use for GNNRegressor and MSNRegressor, but I don't mind RandomForestNNRegressor either.

Yep, I like RFNNRegressor as well. We'll stick with that.

This is a very interesting idea. Given that there is a forest for each target, it does seem like there's a strong potential for poorly performing forests to add a lot of noise to predictions. Would it make sense to weight forests based on an accuracy metric, such that targets that aren't predicted well will contribute less to the neighbor selection? Would this be handled through an optional argument on predict, or an initialization parameter of the estimator, or something else?

Let's pursue this one in a separate issue/PR. It's obviously taken me long enough to get any momentum for this first part!

Thanks for your great feedback.

@aazuspan
Copy link
Contributor

@grovduck, thanks for the detailed rundown! I agree with your hesitation around tying our implementation to an R dependency - that seems like a recipe for maintenance and installation headaches. I can see how that does complicate the prospect of validating our implementation though. I think you covered all the options well, so I don't have anything very useful to add, but in my opinion option 2 sounds like a good compromise that builds some confidence in our implementation without requiring a lot of extra, throwaway effort.

For the "correlation check", were you picturing that as a one-time exercise for our peace of mind, or something that would end up in the port tests? If we could get away with just the former, that would be great, but maybe it would require implementing regression tests in #42 first to ensure we don't break it down the line?

Honestly, 3. seems like a lot of work for something that we're worried about in the short-term (correctly porting yaImpute)

Totally agreed. We could put a caveat in our documentation to explain that there are potential differences from yaImpute due to the different random forest implementations.

If rfMode = "buildClases", any continuous y feature is split into classes using the pretty R function after determining the number of classes using the nclass.Sturges R function. We can probably port these over to match. But I also like your idea of the hybrid option which is how yaImpute behaves if rfMode is not set to "buildClasses". Perhaps we start with this and support the "build automatic classes" at a later time?

Great sleuthing! I like the idea of starting with the simplest viable approach, even if it doesn't match the yaImpute default behavior. With that said, the hybrid approach sounds like it would have some complexity as well if we need to identify continuous vs. categorical targets... Any thoughts on how best to accomplish that? I could see using pandas.Categorical when fitting with dataframes, but not sure how we'd support something similar with Numpy arrays... Does the yaImpute implementation look for factors?

Let's pursue this one in a separate issue/PR.

Sounds reasonable!

@grovduck
Copy link
Member Author

Honestly, 3. seems like a lot of work for something that we're worried about in the short-term (correctly porting yaImpute)

Totally agreed. We could put a caveat in our documentation to explain that there are potential differences from yaImpute due to the different random forest implementations.

@aazuspan, as I dug into this a bit more (and recognizing my own propensity for being absolutely certain!), I realized that the straight port using rpy2 was actually not too much effort and I was able to test against output from yaImpute. I'm going to open up a PR here soon that details my plan further, but I think it's basically:

  1. Push a temporary commit that uses R and rpy2 and tests outputs from yaImpute. If you're able to test these without too much effort (e.g. R setup) and they work locally for you, that will satisfy me. These tests won't pass on CI as we don't have R setup and I don't even want to know if that's possible!
  2. Remove the rpy2 implementation and build in the RandomForestClassifier / RandomForestRegressor solution. I think I'll propose comparing correlation coefficients for the output predicted attributes and make that a one-time check and document it in the PR.
  3. Capture the neighbors/distances/predicted values using the sklearn estimators and replace those in the data files for test_port.py. They won't be the yaImpute output, but it will still make sure that I don't screw something up in the implementation while we iterate on the RFNNRegressor design.

@grovduck grovduck linked a pull request Jan 20, 2025 that will close this issue
7 tasks
@grovduck grovduck linked a pull request Jan 20, 2025 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request estimator Related to one or more estimators
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants