Skip to content

Commit

Permalink
[ENH] Cython axis-aligned multi-view splitter (#129)
Browse files Browse the repository at this point in the history
* Cython multiview

---------

Signed-off-by: Adam Li <adam2392@gmail.com>
  • Loading branch information
adam2392 authored Oct 12, 2023
1 parent ccacf7b commit eb946d4
Show file tree
Hide file tree
Showing 43 changed files with 2,031 additions and 257 deletions.
11 changes: 11 additions & 0 deletions .github/workflows/build_wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,17 @@ concurrency:
cancel-in-progress: true

jobs:
create_artifacts_folder:
name: Create Artifacts Folder
runs-on: ubuntu-latest
outputs:
artifacts_folder: ${{ steps.set_folder.outputs.folder_name }}

steps:
- name: Set Artifacts Folder
id: set_folder
run: echo "folder_name=artifacts/${{ github.run_id }}" >> $GITHUB_ENV

build_wheels:
name: Build wheels on ${{ matrix.os[1] }} - ${{ matrix.os[2] }} with Python ${{ matrix.python[0] }}
runs-on: ${{ matrix.os[0] }}
Expand Down
46 changes: 0 additions & 46 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -249,52 +249,6 @@ jobs:
name: sktree-build
path: $PWD/build

# release is ran when a release is made on Github
release:
name: Release
runs-on: ubuntu-latest
needs: [build_and_test_slow]
if: startsWith(github.ref, 'refs/tags/')
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Setup Python ${{ matrix.python-version }}
uses: actions/setup-python@v4.6.1
with:
python-version: 3.9
architecture: "x64"
- name: Install dependencies
run: |
python -m pip install --progress-bar off --upgrade pip setuptools wheel
python -m pip install --progress-bar off build twine
- name: Prepare environment
run: |
echo "RELEASE_VERSION=${GITHUB_REF#refs/tags/v}" >> $GITHUB_ENV
echo "TAG=${GITHUB_REF#refs/tags/}" >> $GITHUB_ENV
- name: Download package distribution files
uses: actions/download-artifact@v3
with:
name: package
path: dist
# TODO: refactor scripts to generate release notes from `whats_new.rst` file instead
# - name: Generate release notes
# run: |
# python scripts/release_notes.py > ${{ github.workspace }}-RELEASE_NOTES.md
- name: Publish package to PyPI
run: |
twine upload -u ${{ secrets.PYPI_USERNAME }} -p ${{ secrets.PYPI_PASSWORD }} dist/*
- name: Publish GitHub release
uses: softprops/action-gh-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
# body_path: ${{ github.workspace }}-RELEASE_NOTES.md
prerelease: ${{ contains(env.TAG, 'rc') }}
files: |
dist/*
# build-windows:
# name: Meson build Windows
# runs-on: windows-2019
Expand Down
61 changes: 61 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
name: "Release to PyPI"

concurrency:
group: ${{ github.workflow }}-${{ github.event.number }}-${{ github.event.type }}
cancel-in-progress: true

on:
workflow_run:
workflows: ["build_and_test_slow"]
types:
- completed
workflow_dispatch:

env:
INSTALLDIR: "build-install"
CCACHE_DIR: "${{ github.workspace }}/.ccache"

jobs:
# release is ran when a release is made on Github
release:
name: Release
runs-on: ubuntu-latest
needs: [build_and_test_slow]
if: startsWith(github.ref, 'refs/tags/')
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Setup Python ${{ matrix.python-version }}
uses: actions/setup-python@v4.6.1
with:
python-version: 3.9
architecture: "x64"
- name: Install dependencies
run: |
python -m pip install --progress-bar off --upgrade pip setuptools wheel
python -m pip install --progress-bar off build twine
- name: Prepare environment
run: |
echo "RELEASE_VERSION=${GITHUB_REF#refs/tags/v}" >> $GITHUB_ENV
echo "TAG=${GITHUB_REF#refs/tags/}" >> $GITHUB_ENV
- name: Download package distribution files
uses: actions/download-artifact@v3
with:
name: package
path: dist
# TODO: refactor scripts to generate release notes from `whats_new.rst` file instead
# - name: Generate release notes
# run: |
# python scripts/release_notes.py > ${{ github.workspace }}-RELEASE_NOTES.md
- name: Publish package to PyPI
run: |
twine upload -u ${{ secrets.PYPI_USERNAME }} -p ${{ secrets.PYPI_PASSWORD }} dist/*
- name: Publish GitHub release
uses: softprops/action-gh-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
# body_path: ${{ github.workspace }}-RELEASE_NOTES.md
prerelease: ${{ contains(env.TAG, 'rc') }}
files: |
dist/*
4 changes: 4 additions & 0 deletions DEVELOPING.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,4 +76,8 @@ Verify that installations work as expected on your machine.

twine upload dist/*

or if you have two-factor authentication enabled: https://pypi.org/help/#apitoken

twine upload dist/* --repository scikit-tree

4. Update version number on ``meson.build`` and ``_version.py`` to the relevant version.
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ how scikit-learn builds trees.
PatchObliqueRandomForestClassifier
PatchObliqueRandomForestRegressor
HonestForestClassifier
MultiViewRandomForestClassifier

.. currentmodule:: sktree.tree
.. autosummary::
Expand All @@ -75,6 +76,7 @@ how scikit-learn builds trees.
PatchObliqueDecisionTreeClassifier
PatchObliqueDecisionTreeRegressor
HonestTreeClassifier
MultiViewDecisionTreeClassifier

Unsupervised
------------
Expand Down
1 change: 1 addition & 0 deletions doc/whats_new/v0.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Changelog
---------
- |Fix| Fixes a bug in consistency of train/test samples when ``random_state`` is not set in FeatureImportanceForestClassifier and FeatureImportanceForestRegressor, by `Adam Li`_ (:pr:`135`)
- |Fix| Fixes a bug where covariate indices were not shuffled by default when running FeatureImportanceForestClassifier and FeatureImportanceForestRegressor test methods, by `Sambit Panda`_ (:pr:`140`)
- |Enhancement| Add multi-view splitter for axis-aligned decision trees, by `Adam Li`_ (:pr:`129`)

Code and Documentation Contributors
-----------------------------------
Expand Down
2 changes: 1 addition & 1 deletion examples/README.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Examples
--------
========

Examples demonstrating how to use scikit-tree algorithms.
6 changes: 6 additions & 0 deletions examples/hypothesis_testing/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.. _hyppo_examples:

Hypothesis testing with decision trees
--------------------------------------

Examples demonstrating how to use decision-trees for statistical hypothesis testing.
File renamed without changes.
6 changes: 6 additions & 0 deletions examples/outlier_detection/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.. _outlier_examples:

Outlier-detection
-----------------

Examples concerning how to do outlier detection with decision trees.
File renamed without changes.
4 changes: 4 additions & 0 deletions examples/plot_extra_oblique_random_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,9 @@
are not sorted and the split is determined by randomly drawing a threshold from the feature's
range, hence the complexity is `O(n)`. This makes the algorithm more suitable for large datasets.
To see how sample-sizes affect the performance of Extra Oblique Trees vs regular Oblique Trees,
see :ref:`sphx_glr_auto_examples_plot_extra_orf_sample_size.py`
References
----------
.. [1] P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized trees", Machine Learning, 63(1),
Expand Down Expand Up @@ -149,6 +152,7 @@ def get_scores(X, y, d_name, n_cv=5, n_repeats=1, **kwargs):
"random_state": random_state,
"n_cv": 10,
"n_repeats": 1,
"n_jobs": -1,
}

for data_id in data_ids:
Expand Down
1 change: 1 addition & 0 deletions examples/plot_extra_orf_sample_size.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ def get_scores(X, y, d_name, n_cv=5, n_repeats=1, **kwargs):
"random_state": random_state,
"n_cv": 10,
"n_repeats": 1,
"n_jobs": -1,
}

for data_id in data_ids:
Expand Down
3 changes: 3 additions & 0 deletions examples/plot_oblique_random_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@
will notice, of these three datasets, the oblique forest outperforms axis-aligned
random forest on cnae-9 utilizing sparse random projection mechanism. All datasets
are subsampled due to computational constraints.
For an example of using extra-oblique trees/forests in practice on data, see the following
example :ref:`sphx_glr_auto_examples_plot_extra_oblique_random_forest.py`.
"""

from datetime import datetime
Expand Down
6 changes: 6 additions & 0 deletions examples/splitters/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.. _splitter_examples:

Decision-tree splitters
-----------------------

Examples demonstrating different node-splitting strategies for decision trees.
124 changes: 124 additions & 0 deletions examples/splitters/plot_multiview_axis_aligned_splitter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
"""
=================================================================================
Demonstrate and visualize a multi-view projection matrix for an axis-aligned tree
=================================================================================
This example shows how multi-view projection matrices are generated for a decision tree,
specifically the :class:`sktree.tree.MultiViewDecisionTreeClassifier`.
Multi-view projection matrices operate under the assumption that the input ``X`` array
consists of multiple feature-sets that are groups of features important for predicting
``y``.
For details on how to use the hyperparameters related to the multi-view, see
:class:`sktree.tree.MultiViewDecisionTreeClassifier`.
"""

# import modules
# .. note:: We use a private Cython module here to demonstrate what the patches
# look like. This is not part of the public API. The Cython module used
# is just a Python wrapper for the underlying Cython code and is not the
# same as the Cython splitter used in the actual implementation.
# To use the actual splitter, one should use the public API for the
# relevant tree/forests class.

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.cm import ScalarMappable
from matplotlib.colors import ListedColormap

from sktree._lib.sklearn.tree._criterion import Gini
from sktree.tree._oblique_splitter import MultiViewSplitterTester

criterion = Gini(1, np.array((0, 1)))
max_features = 5
min_samples_leaf = 1
min_weight_leaf = 0.0
random_state = np.random.RandomState(10)

# we "simulate" three feature sets, with 3, 2 and 4 features respectively
feature_set_ends = np.array([3, 5, 9], dtype=np.intp)
n_feature_sets = len(feature_set_ends)

feature_combinations = 1
monotonic_cst = None
missing_value_feature_mask = None

# initialize some dummy data
X = np.repeat(np.arange(feature_set_ends[-1]).astype(np.float32), 5).reshape(5, -1)
y = np.array([0, 0, 0, 1, 1]).reshape(-1, 1).astype(np.float64)
sample_weight = np.ones(5)

print("The shape of our dataset is: ", X.shape, y.shape, sample_weight.shape)

# %%
# Initialize the multi-view splitter
# ----------------------------------
# The multi-view splitter is a Cython class that is initialized internally
# in scikit-tree. However, we expose a Python tester object to demonstrate
# how the splitter works in practice.
#
# .. warning:: Do not use this interface directly in practice.

splitter = MultiViewSplitterTester(
criterion,
max_features,
min_samples_leaf,
min_weight_leaf,
random_state,
monotonic_cst,
feature_combinations,
feature_set_ends,
n_feature_sets,
)
splitter.init_test(X, y, sample_weight, missing_value_feature_mask)

# %%
# Sample the projection matrix
# ----------------------------
# The projection matrix is sampled by the splitter. The projection
# matrix is a (max_features, n_features) matrix that selects which features of ``X``
# to define candidate split dimensions. The multi-view
# splitter's projection matrix though samples from multiple feature sets,
# which are aligned contiguously over the columns of ``X``.

projection_matrix = splitter.sample_projection_matrix_py()
print(projection_matrix)

cmap = ListedColormap(["white", "green"][:n_feature_sets])

# Create a heatmap to visualize the indices
fig, ax = plt.subplots(figsize=(6, 6))

ax.imshow(
projection_matrix, cmap=cmap, aspect=feature_set_ends[-1] / max_features, interpolation="none"
)
ax.axvline(feature_set_ends[0] - 0.5, color="black", linewidth=1, label="Feature Sets")
for iend in feature_set_ends[1:]:
ax.axvline(iend - 0.5, color="black", linewidth=1)

ax.set(title="Sampled Projection Matrix", xlabel="Feature Index", ylabel="Projection Vector Index")
ax.set_xticks(np.arange(feature_set_ends[-1]))
ax.set_yticks(np.arange(max_features))
ax.set_yticklabels(np.arange(max_features, dtype=int) + 1)
ax.set_xticklabels(np.arange(feature_set_ends[-1], dtype=int) + 1)
ax.legend()

# Create a mappable object
sm = ScalarMappable(cmap=cmap)
sm.set_array([]) # You can set an empty array or values here

# Create a color bar with labels for each feature set
colorbar = fig.colorbar(sm, ax=ax, ticks=[0.25, 0.75], format="%d")
colorbar.set_label("Projection Weight (I.e. Sampled Feature From a Feature Set)")
colorbar.ax.set_yticklabels(["0", "1"])

plt.show()

# %%
# Discussion
# ----------
# As we can see, the multi-view splitter samples split candidates uniformly across the feature sets.
# In contrast, the normal splitter in :class:`sklearn.tree.DecisionTreeClassifier` samples
# randomly across all ``n_features`` features because it is not aware of the multi-view structure.
# This is the key difference between the two splitters.
Loading

0 comments on commit eb946d4

Please sign in to comment.