Skip to content

Commit

Permalink
Merge branch 'main' into update_testv2
Browse files Browse the repository at this point in the history
  • Loading branch information
PSSF23 authored Oct 23, 2023
2 parents 030cb5d + 359ea75 commit 896b312
Show file tree
Hide file tree
Showing 13 changed files with 258 additions and 34 deletions.
22 changes: 22 additions & 0 deletions .github/workflows/cffconvert.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: cffconvert

on:
push:
paths:
- CITATION.cff
pull_request:
paths:
- CITATION.cff

jobs:
validate:
name: "validate"
runs-on: ubuntu-latest
steps:
- name: Check out a copy of the repository
uses: actions/checkout@v4

- name: Check whether the citation metadata from CITATION.cff is valid
uses: citation-file-format/cffconvert-github-action@2.0.0
with:
args: "--validate"
31 changes: 31 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# YAML 1.2
---
# Metadata for citation of this software according to the CFF format (https://citation-file-format.github.io/)
cff-version: 1.2.0
title: "Scikit-tree: Modern decision-trees compatible with scikit-learn in Python."
abstract: "scikit-tree is a scikit-learn compatible API for building state-of-the-art decision trees. These include unsupervised trees, oblique trees, uncertainty trees, quantile trees and causal trees."
authors:
- given-names: Adam
family-names: Li
affiliation: "Department of Computer Science, Columbia University, New York, NY, USA"
orcid: "https://orcid.org/0000-0001-8421-365X"
- given-names: Sambit
family-names: Panda
affiliation: "Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA"
orcid: "https://orcid.org/0000-0001-8455-4243"
- given-names: Haoyin
family-names: Xu
affiliation: "Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA"
orcid: "https://orcid.org/0000-0001-8235-4950"
type: software
repository-code: "https://github.com/neurodata/scikit-tree"
license: 'BSD-3-Clause'
keywords:
- random forest
- oblique trees
- honest forests
- statisical learning
- machine learning
message: >-
Please cite this software using the metadata from
'preferred-citation' in the CITATION.cff file.
122 changes: 116 additions & 6 deletions DEVELOPING.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,103 @@
<!-- TOC -->

- [Requirements](#requirements)
- [Setting up your development environment](#setting-up-your-development-environment)
- [Building the project from source](#building-the-project-from-source)
- [Development Tasks](#development-tasks)
- [Basic Verification](#basic-verification)
- [Docsite](#docsite)
- [Details](#details)
- [Coding Style](#coding-style)
- [Lint](#lint)
- [Type checking](#type-checking)
- [Unit tests](#unit-tests)
- [Advanced Updating submodules](#advanced-updating-submodules)
- [Cython and C++](#cython-and-c)
- [Making a Release](#making-a-release)

<!-- /TOC -->

# Requirements
* Python 3.8+
* Poetry (`curl -sSL https://install.python-poetry.org | python - --version=1.2.2`)

For the other requirements, inspect the ``pyproject.toml`` file. If you are updated the dependencies, please run `poetry update` to update the
* Python 3.9+
* numpy>=1.25
* scipy>=1.11
* scikit-learn>=1.3.1

For the other requirements, inspect the ``pyproject.toml`` file.

# Setting up your development environment

We recommend using miniconda, as python virtual environments may not setup properly compilers necessary for our compiled code. For detailed information on setting up and managing conda environments, see https://conda.io/docs/test-drive.html.

<!-- Setup a conda env -->

conda create -n sktree
conda activate sktree

**Make sure you specify a Python version if your system defaults to anything less than Python 3.9.**

**Any commands should ALWAYS be after you have activated your conda environment.**
Next, install necessary build dependencies. For more information, see https://scikit-learn.org/stable/developers/advanced_installation.html.

conda install -c conda-forge joblib threadpoolctl pytest compilers llvm-openmp

Assuming these steps have worked properly and you have read and followed any necessary scikit-learn advanced installation instructions, you can then install dependencies for scikit-tree.

If you are developing locally, you will need the build dependencies to compile the Cython / C++ code:

pip install -r build_requirements.txt

Other requirements can be installed as such:

pip install -r requirements.txt
pip install -r style_requirements.txt
pip install -r test_requirements.txt
pip install -r doc_requirements.txt

# Building the project from source

We leverage meson to build scikit-tree from source. We utilize a CLI tool, called [spin](https://github.com/scientific-python/spin), which wraps certain meson commands to make building easier.

For example, the following command will build the project completely from scratch

spin build --clean

If you have part of the build already done, you can run:

spin build

The following command will test the project

spin test

For other commands, see

spin --help

Note at this stage, you will be unable to run Python commands directly. For example, ``pytest ./sktree`` will not work.

However, after installing and building the project from source using meson, you can leverage editable installs to make testing code changes much faster. For more information on meson-python's progress supporting editable installs in a better fashion, see https://meson-python.readthedocs.io/en/latest/how-to-guides/editable-installs.html.

pip install --no-build-isolation --editable .

**Note: editable installs for scikit-tree REQUIRE you to have built the project using meson already.** This will now link the meson build to your Python runtime. Now if you run

pytest ./sktree

the unit-tests should run.

# Development Tasks
There are a series of top-level tasks available through Poetry. These can each be run via
There are a series of top-level tasks available through Poetry. If you are updated the dependencies, please run `poetry update` to update the lock file. These can each be run via

`poetry run poe <taskname>`

To do so, first install poetry and poethepoet.

pip install poetry poethepoet

Now, you are ready to run quick commands to format the codebase, lint the codebase and type-check the codebase.

### Basic Verification
* **format** - runs the suite of formatting tools applying tools to make code compliant
* **format_check** - runs the suite of formatting tools checking for compliance
Expand Down Expand Up @@ -53,6 +142,23 @@ In order for any code to be added to the repository, we require unit tests to pa

poetry run poe unit_test

# (Advanced) Updating submodules

Scikit-tree relies on a submodule of a forked-version of scikit-learn for certain Python and Cython code that extends the ``DecisionTree*`` models. Usually, if a developer is making changes, they should go over to the ``submodulev3`` branch on ``https://github.com/neurodata/scikit-learn`` and
submit a PR to make changes to the submodule.

This should **ALWAYS** be supported by some use-case in scikit-tree. We want the minimal amount of code-change in our forked version of scikit-learn to make it very easy to merge in upstream changes, bug fixes and features for tree-based code.

Once a PR is submitted and merged, the developer can update the submodule here in scikit-tree, so that we leverage the new commit. You **must** update the submodule commit ID and also commit this change, so that way the build leverages the new submodule commit ID.

git submodule update --init --recursive --remote
git add -A
git commit -m "Update submodule" -s

Now, you can re-build the project using the latest submodule changes.

spin build --clean

# Cython and C++
The general design of scikit-tree follows that of the tree-models inside scikit-learn, where tree-based models are inherently Cythonized, or written with C++. Then the actual forest (e.g. RandomForest, or ExtraForest) is just a Python API wrapper that creates an ensemble of the trees.

Expand All @@ -68,13 +174,17 @@ https://github.com/neurodata/scikit-tree/actions/workflows/build_wheels.yml will

2. Upload wheels to test PyPi

twine upload --repository-url https://test.pypi.org/legacy/ dist/*
```
twine upload --repository-url https://test.pypi.org/legacy/ dist/*
```

Verify that installations work as expected on your machine.

3. Upload wheels

twine upload dist/*
```
twine upload dist/*
```

or if you have two-factor authentication enabled: https://pypi.org/help/#apitoken

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
[![codecov](https://codecov.io/gh/neurodata/scikit-tree/branch/main/graph/badge.svg?token=H1reh7Qwf4)](https://codecov.io/gh/neurodata/scikit-tree)
[![PyPI Download count](https://img.shields.io/pypi/dm/scikit-tree.svg)](https://pypistats.org/packages/scikit-tree)
[![Latest PyPI release](https://img.shields.io/pypi/v/scikit-tree.svg)](https://pypi.org/project/scikit-tree/)
[![DOI](https://zenodo.org/badge/491260497.svg)](https://zenodo.org/doi/10.5281/zenodo.8412279)

scikit-tree
===========
Expand Down
1 change: 1 addition & 0 deletions doc/whats_new/_contributors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,4 @@
.. _SUKI-O : https://github.com/SUKI-O
.. _Ronan Perry : https://rflperry.github.io/
.. _Haoyin Xu : https://github.com/PSSF23
.. _Yuxin Bai : https://github.com/YuxinB
3 changes: 2 additions & 1 deletion doc/whats_new/v0.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Changelog
- |Fix| Fixes a bug in consistency of train/test samples when ``random_state`` is not set in FeatureImportanceForestClassifier and FeatureImportanceForestRegressor, by `Adam Li`_ (:pr:`135`)
- |Fix| Fixes a bug where covariate indices were not shuffled by default when running FeatureImportanceForestClassifier and FeatureImportanceForestRegressor test methods, by `Sambit Panda`_ (:pr:`140`)
- |Enhancement| Add multi-view splitter for axis-aligned decision trees, by `Adam Li`_ (:pr:`129`)
- |Enhancement| Add stratified sampling option to ``FeatureImportance*`` via the ``stratify`` keyword argument, by `Yuxin Bai`_ (:pr:`143`)

Code and Documentation Contributors
-----------------------------------
Expand All @@ -24,4 +25,4 @@ the project since version inception, including:

* `Adam Li`_
* `Sambit Panda`_

* `Yuxin Bai`_
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
===========================================================
Mutual Information for Gigantic Hypothesis Testing (MIGHT)
===========================================================
=========================================================
Mutual Information for Genuine Hypothesis Testing (MIGHT)
=========================================================
An example using :class:`~sktree.stats.FeatureImportanceForestClassifier` for nonparametric
multivariate hypothesis test, on simulated datasets. Here, we present a simulation
Expand Down Expand Up @@ -49,8 +49,8 @@
# We simulate the two feature sets, and the target variable. We then combine them
# into a single dataset to perform hypothesis testing.

n_samples = 1000
n_features_set = 500
n_samples = 2000
n_features_set = 20
mean = 1.0
sigma = 2.0
beta = 5.0
Expand Down Expand Up @@ -91,7 +91,7 @@
# computed as the proportion of samples in the null distribution that are less than the
# observed test statistic.

n_estimators = 200
n_estimators = 100
max_features = "sqrt"
test_size = 0.2
n_repeats = 1000
Expand All @@ -103,12 +103,12 @@
max_features=max_features,
tree_estimator=DecisionTreeClassifier(),
random_state=seed,
honest_fraction=0.7,
honest_fraction=0.25,
n_jobs=n_jobs,
),
random_state=seed,
test_size=test_size,
permute_per_tree=True,
permute_per_tree=False,
sample_dataset_per_tree=False,
)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
===============================================================================
Mutual Information for Gigantic Hypothesis Testing (MIGHT) with Imbalanced Data
===============================================================================
==============================================================================
Mutual Information for Genuine Hypothesis Testing (MIGHT) with Imbalanced Data
==============================================================================
Here, we demonstrate how to do hypothesis testing on highly imbalanced data
in terms of their feature-set dimensionalities.
Expand All @@ -17,7 +17,7 @@
For other examples of hypothesis testing, see the following:
- :ref:`sphx_glr_auto_examples_hypothesis_testing_plot_MI_gigantic_hypothesis_testing_forest.py`
- :ref:`sphx_glr_auto_examples_hypothesis_testing_plot_MI_genuine_hypothesis_testing_forest.py`
- :ref:`sphx_glr_auto_examples_hypothesis_testing_plot_might_auc.py`
For more information on the multi-view decision-tree, see
Expand Down
2 changes: 2 additions & 0 deletions examples/hypothesis_testing/plot_might_auc.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@
y,
metric=metric,
return_posteriors=True,
max_fpr=max_fpr,
)

print(f"ASH-90 / Partial AUC: {stat}")
Expand All @@ -110,6 +111,7 @@
y,
metric=metric,
return_posteriors=True,
max_fpr=max_fpr,
)

print(f"ASH-90 / Partial AUC: {stat}")
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
numpy>=1.25
scipy
scipy>=1.11
scikit-learn>=1.3.1

Loading

0 comments on commit 896b312

Please sign in to comment.