Merge branch 'main' into update_testv2

neurodata · Oct 23, 2023 · 896b312 · 896b312
2 parents 030cb5d + 359ea75
commit 896b312
Show file tree

Hide file tree

Showing 13 changed files with 258 additions and 34 deletions.
diff --git a/.github/workflows/cffconvert.yml b/.github/workflows/cffconvert.yml
@@ -0,0 +1,22 @@
+name: cffconvert
+
+on:
+  push:
+    paths:
+      - CITATION.cff
+  pull_request:
+    paths:
+      - CITATION.cff        
+
+jobs:
+  validate:
+    name: "validate"
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out a copy of the repository
+        uses: actions/checkout@v4
+
+      - name: Check whether the citation metadata from CITATION.cff is valid
+        uses: citation-file-format/cffconvert-github-action@2.0.0
+        with:
+          args: "--validate"
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,31 @@
+# YAML 1.2
+---
+# Metadata for citation of this software according to the CFF format (https://citation-file-format.github.io/)
+cff-version: 1.2.0
+title: "Scikit-tree: Modern decision-trees compatible with scikit-learn in Python."
+abstract: "scikit-tree is a scikit-learn compatible API for building state-of-the-art decision trees. These include unsupervised trees, oblique trees, uncertainty trees, quantile trees and causal trees."
+authors:
+  - given-names: Adam
+    family-names: Li
+    affiliation: "Department of Computer Science, Columbia University, New York, NY, USA"
+    orcid: "https://orcid.org/0000-0001-8421-365X"
+  - given-names: Sambit
+    family-names: Panda
+    affiliation: "Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA"
+    orcid: "https://orcid.org/0000-0001-8455-4243"
+  - given-names: Haoyin
+    family-names: Xu
+    affiliation: "Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA"
+    orcid: "https://orcid.org/0000-0001-8235-4950"
+type: software
+repository-code: "https://github.com/neurodata/scikit-tree"
+license: 'BSD-3-Clause'
+keywords:
+  - random forest
+  - oblique trees
+  - honest forests
+  - statisical learning
+  - machine learning
+message: >-
+  Please cite this software using the metadata from
+  'preferred-citation' in the CITATION.cff file.
diff --git a/DEVELOPING.md b/DEVELOPING.md
@@ -1,14 +1,103 @@
+<!-- TOC -->
+
+- [Requirements](#requirements)
+- [Setting up your development environment](#setting-up-your-development-environment)
+- [Building the project from source](#building-the-project-from-source)
+- [Development Tasks](#development-tasks)
+        - [Basic Verification](#basic-verification)
+        - [Docsite](#docsite)
+    - [Details](#details)
+        - [Coding Style](#coding-style)
+        - [Lint](#lint)
+        - [Type checking](#type-checking)
+        - [Unit tests](#unit-tests)
+- [Advanced Updating submodules](#advanced-updating-submodules)
+- [Cython and C++](#cython-and-c)
+- [Making a Release](#making-a-release)
+
+<!-- /TOC -->
+
 # Requirements
-* Python 3.8+
-* Poetry (`curl -sSL https://install.python-poetry.org | python - --version=1.2.2`)
 
-For the other requirements, inspect the ``pyproject.toml`` file. If you are updated the dependencies, please run `poetry update` to update the
+* Python 3.9+
+* numpy>=1.25
+* scipy>=1.11
+* scikit-learn>=1.3.1
+
+For the other requirements, inspect the ``pyproject.toml`` file.
+
+# Setting up your development environment
+
+We recommend using miniconda, as python virtual environments may not setup properly compilers necessary for our compiled code. For detailed information on setting up and managing conda environments, see https://conda.io/docs/test-drive.html.
+
+<!-- Setup a conda env -->
+
+    conda create -n sktree
+    conda activate sktree
+
+**Make sure you specify a Python version if your system defaults to anything less than Python 3.9.**
+
+**Any commands should ALWAYS be after you have activated your conda environment.**
+Next, install necessary build dependencies. For more information, see https://scikit-learn.org/stable/developers/advanced_installation.html.
+
+    conda install -c conda-forge joblib threadpoolctl pytest compilers llvm-openmp
+
+Assuming these steps have worked properly and you have read and followed any necessary scikit-learn advanced installation instructions, you can then install dependencies for scikit-tree.
+
+If you are developing locally, you will need the build dependencies to compile the Cython / C++ code:
+
+    pip install -r build_requirements.txt
+
+Other requirements can be installed as such:
+
+    pip install -r requirements.txt
+    pip install -r style_requirements.txt
+    pip install -r test_requirements.txt
+    pip install -r doc_requirements.txt
+
+# Building the project from source
+
+We leverage meson to build scikit-tree from source. We utilize a CLI tool, called [spin](https://github.com/scientific-python/spin), which wraps certain meson commands to make building easier.
+
+For example, the following command will build the project completely from scratch
+
+    spin build --clean
+
+If you have part of the build already done, you can run:
+
+    spin build
+
+The following command will test the project
+
+    spin test
+
+For other commands, see
+
+    spin --help
+
+Note at this stage, you will be unable to run Python commands directly. For example, ``pytest ./sktree`` will not work.
+
+However, after installing and building the project from source using meson, you can leverage editable installs to make testing code changes much faster. For more information on meson-python's progress supporting editable installs in a better fashion, see https://meson-python.readthedocs.io/en/latest/how-to-guides/editable-installs.html.
+
+    pip install --no-build-isolation --editable .
+
+**Note: editable installs for scikit-tree REQUIRE you to have built the project using meson already.** This will now link the meson build to your Python runtime. Now if you run
+
+    pytest ./sktree
+
+the unit-tests should run.
 
 # Development Tasks
-There are a series of top-level tasks available through Poetry. These can each be run via
+There are a series of top-level tasks available through Poetry. If you are updated the dependencies, please run `poetry update` to update the lock file. These can each be run via
 
  `poetry run poe <taskname>`
 
+To do so, first install poetry and poethepoet.
+
+    pip install poetry poethepoet
+
+Now, you are ready to run quick commands to format the codebase, lint the codebase and type-check the codebase.
+
 ### Basic Verification
 * **format** - runs the suite of formatting tools applying tools to make code compliant
 * **format_check** - runs the suite of formatting tools checking for compliance
@@ -53,6 +142,23 @@ In order for any code to be added to the repository, we require unit tests to pa
 
     poetry run poe unit_test
 
+# (Advanced) Updating submodules
+
+Scikit-tree relies on a submodule of a forked-version of scikit-learn for certain Python and Cython code that extends the ``DecisionTree*`` models. Usually, if a developer is making changes, they should go over to the ``submodulev3`` branch on ``https://github.com/neurodata/scikit-learn`` and 
+submit a PR to make changes to the submodule. 
+
+This should **ALWAYS** be supported by some use-case in scikit-tree. We want the minimal amount of code-change in our forked version of scikit-learn to make it very easy to merge in upstream changes, bug fixes and features for tree-based code.
+
+Once a PR is submitted and merged, the developer can update the submodule here in scikit-tree, so that we leverage the new commit. You **must** update the submodule commit ID and also commit this change, so that way the build leverages the new submodule commit ID.
+
+    git submodule update --init --recursive --remote
+    git add -A
+    git commit -m "Update submodule" -s
+
+Now, you can re-build the project using the latest submodule changes.
+
+    spin build --clean
+
 # Cython and C++
 The general design of scikit-tree follows that of the tree-models inside scikit-learn, where tree-based models are inherently Cythonized, or written with C++. Then the actual forest (e.g. RandomForest, or ExtraForest) is just a Python API wrapper that creates an ensemble of the trees.
 
@@ -68,13 +174,17 @@ https://github.com/neurodata/scikit-tree/actions/workflows/build_wheels.yml will
 
 2. Upload wheels to test PyPi
 
-    twine upload --repository-url https://test.pypi.org/legacy/ dist/*
+```
+twine upload --repository-url https://test.pypi.org/legacy/ dist/*
+```
 
 Verify that installations work as expected on your machine.
 
 3. Upload wheels
 
-    twine upload dist/*
+```
+twine upload dist/*
+```
 
 or if you have two-factor authentication enabled: https://pypi.org/help/#apitoken
 

diff --git a/README.md b/README.md
@@ -5,6 +5,7 @@
 [![codecov](https://codecov.io/gh/neurodata/scikit-tree/branch/main/graph/badge.svg?token=H1reh7Qwf4)](https://codecov.io/gh/neurodata/scikit-tree)
 [![PyPI Download count](https://img.shields.io/pypi/dm/scikit-tree.svg)](https://pypistats.org/packages/scikit-tree)
 [![Latest PyPI release](https://img.shields.io/pypi/v/scikit-tree.svg)](https://pypi.org/project/scikit-tree/)
+[![DOI](https://zenodo.org/badge/491260497.svg)](https://zenodo.org/doi/10.5281/zenodo.8412279)
 
 scikit-tree
 ===========

diff --git a/doc/whats_new/_contributors.rst b/doc/whats_new/_contributors.rst
@@ -26,3 +26,4 @@
 .. _SUKI-O : https://github.com/SUKI-O
 .. _Ronan Perry : https://rflperry.github.io/
 .. _Haoyin Xu : https://github.com/PSSF23
+.. _Yuxin Bai : https://github.com/YuxinB
diff --git a/doc/whats_new/v0.3.rst b/doc/whats_new/v0.3.rst
@@ -15,6 +15,7 @@ Changelog
 - |Fix| Fixes a bug in consistency of train/test samples when ``random_state`` is not set in FeatureImportanceForestClassifier and FeatureImportanceForestRegressor, by `Adam Li`_ (:pr:`135`)
 - |Fix| Fixes a bug where covariate indices were not shuffled by default when running FeatureImportanceForestClassifier and FeatureImportanceForestRegressor test methods, by `Sambit Panda`_ (:pr:`140`)
 - |Enhancement| Add multi-view splitter for axis-aligned decision trees, by `Adam Li`_ (:pr:`129`)
+- |Enhancement| Add stratified sampling option to ``FeatureImportance*`` via the ``stratify`` keyword argument, by `Yuxin Bai`_ (:pr:`143`)
 
 Code and Documentation Contributors
 -----------------------------------
@@ -24,4 +25,4 @@ the project since version inception, including:
 
 * `Adam Li`_
 * `Sambit Panda`_
-
+* `Yuxin Bai`_
diff --git a/..._MI_gigantic_hypothesis_testing_forest.py → ...t_MI_genuine_hypothesis_testing_forest.py b/..._MI_gigantic_hypothesis_testing_forest.py → ...t_MI_genuine_hypothesis_testing_forest.py
@@ -1,7 +1,7 @@
 """
-===========================================================
-Mutual Information for Gigantic Hypothesis Testing (MIGHT)
-===========================================================
+=========================================================
+Mutual Information for Genuine Hypothesis Testing (MIGHT)
+=========================================================
 
 An example using :class:`~sktree.stats.FeatureImportanceForestClassifier` for nonparametric
 multivariate hypothesis test, on simulated datasets. Here, we present a simulation
@@ -49,8 +49,8 @@
 # We simulate the two feature sets, and the target variable. We then combine them
 # into a single dataset to perform hypothesis testing.
 
-n_samples = 1000
-n_features_set = 500
+n_samples = 2000
+n_features_set = 20
 mean = 1.0
 sigma = 2.0
 beta = 5.0
@@ -91,7 +91,7 @@
 # computed as the proportion of samples in the null distribution that are less than the
 # observed test statistic.
 
-n_estimators = 200
+n_estimators = 100
 max_features = "sqrt"
 test_size = 0.2
 n_repeats = 1000
@@ -103,12 +103,12 @@
         max_features=max_features,
         tree_estimator=DecisionTreeClassifier(),
         random_state=seed,
-        honest_fraction=0.7,
+        honest_fraction=0.25,
         n_jobs=n_jobs,
     ),
     random_state=seed,
     test_size=test_size,
-    permute_per_tree=True,
+    permute_per_tree=False,
     sample_dataset_per_tree=False,
 )
 

diff --git a/examples/hypothesis_testing/plot_MI_imbalanced_hyppo_testing.py b/examples/hypothesis_testing/plot_MI_imbalanced_hyppo_testing.py
@@ -1,7 +1,7 @@
 """
-===============================================================================
-Mutual Information for Gigantic Hypothesis Testing (MIGHT) with Imbalanced Data
-===============================================================================
+==============================================================================
+Mutual Information for Genuine Hypothesis Testing (MIGHT) with Imbalanced Data
+==============================================================================
 
 Here, we demonstrate how to do hypothesis testing on highly imbalanced data
 in terms of their feature-set dimensionalities.
@@ -17,7 +17,7 @@
 
 For other examples of hypothesis testing, see the following:
 
-- :ref:`sphx_glr_auto_examples_hypothesis_testing_plot_MI_gigantic_hypothesis_testing_forest.py`
+- :ref:`sphx_glr_auto_examples_hypothesis_testing_plot_MI_genuine_hypothesis_testing_forest.py`
 - :ref:`sphx_glr_auto_examples_hypothesis_testing_plot_might_auc.py`
 
 For more information on the multi-view decision-tree, see

diff --git a/examples/hypothesis_testing/plot_might_auc.py b/examples/hypothesis_testing/plot_might_auc.py
@@ -94,6 +94,7 @@
     y,
     metric=metric,
     return_posteriors=True,
+    max_fpr=max_fpr,
 )
 
 print(f"ASH-90 / Partial AUC: {stat}")
@@ -110,6 +111,7 @@
     y,
     metric=metric,
     return_posteriors=True,
+    max_fpr=max_fpr,
 )
 
 print(f"ASH-90 / Partial AUC: {stat}")

diff --git a/requirements.txt b/requirements.txt
@@ -1,3 +1,4 @@
 numpy>=1.25
-scipy
+scipy>=1.11
 scikit-learn>=1.3.1
+