Merge pull request #4 from JacksonBurns/b2_patches

Beta release 3 (1.0.0b3) - this release fixes an issue with feature scaling that led to some data leakage (model results should not be appreciably different, but let's cross our I and dot our Ts) and has a complete draft of the corresponding paper.
JacksonBurns · Feb 28, 2024 · 64b7775 · 64b7775
2 parents 24341f7 + c3b7bfc
commit 64b7775
Show file tree

Hide file tree

Showing 41 changed files with 1,382 additions and 1,510 deletions.
diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml
@@ -69,17 +69,14 @@ jobs:
           coverage report -m
   
   ci-report-status:
+    if: ${{ always() }}
     name: report CI status
-    needs: build-and-test
+    needs: [build-and-test]
     runs-on: ubuntu-latest
     steps:
-      - run: |
-          result="${{ needs.build-and-test.result }}"
-          if [[ $result == "success" ]] ; then
-            exit 0
-          else
-            exit 1
-          fi
+      - run: exit 1
+        # see https://stackoverflow.com/a/67532120/4907315
+        if: ${{ contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') || contains(needs.*.result, 'skipped') }}
 
   check-for-new-release:
     runs-on: ubuntu-latest

diff --git a/.gitignore b/.gitignore
@@ -28,9 +28,6 @@ benchmarks/ocelot/174705_1_supplements.zip
 benchmarks/ocelot/ocelot_chromophore_v1.zip
 benchmarks/ocelot/ocelot_chromophore_v1/readme.txt
 
-# paper build files
-paper/paper.pdf
-
 # todo's
 benchmarks/tox21
 benchmarks/ace

diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@
 </p>
 
 # Announcement - Open Beta!
-`fastprop` is currently in the version 1 open beta!
+`fastprop` is currently in the version 3 open beta (1.0.0b3)!
 Please try `fastprop` on your datasets and let us know what you think.
 Feature requests and bug reports are **very** appreciated!
 
@@ -74,12 +74,14 @@ There are four distinct steps in `fastprop` that define its framework:
     _or_
     - Load precomputed descriptors: filepath to where descriptors are already cached either manually or by `fastprop`
  2. Preprocessing
-    - Enable/Disable re-scaling of parameters between 0 and 1 (enabled by default and _highly_ recommended)
     - Enable/Disable dropping of zero-variance parameters (disabled by default; faster, but often less accurate)
+
     ~~- Enable/Disable dropping of co-linear descriptors (disabled by default; faster, decreased accuracy)~~ _WIP_
-    - _not configurable_: `fastprop` will always drop columns with no values and impute missing values with the mean per-column
+    - _not configurable_: `fastprop` will always rescale input features, drop columns with no values, and impute missing values with the per-feature mean
  3. Training
     - Number of Repeats: How many times to split/train/test on the dataset (increments random seed by 1 each time).
+
+    _and_
     - Number of FNN layers (default 2; repeated fully connected layers of hidden size)
     - Hidden Size: number of neurons per FNN layer (default 1800)
 
@@ -90,8 +92,9 @@ There are four distinct steps in `fastprop` that define its framework:
     - Output Directory
     - Learning rate
     - Batch size
+
     ~~- Checkpoint file to resume from (optional)~~ _WIP_
-    - Problem type (one of: regression, binary, multiclass, multilabel)
+    - Problem type (one of: regression, binary, multiclass (start labels from 0), multilabel)
  4. Prediction
     - Input SMILES: either a single SMILES or a CSV file
     - Output format: filepath to write the results or nothing, defaults to stdout
@@ -135,7 +138,7 @@ To use the core `fastprop` model and dataloaders in your own work, consider look
 
 ### `fastprop`
  - `defaults`: contains the function `init_logger` used to initialize loggers in different submodules, as well as the default configuration for training.
- - `fastprop_core`: the model itself, data PyTorch Lightning dataloader, and convenience functions.
+ - `fastprop_core`: the model itself and convenience functions.
  - `hopt`: hyperparameter optimization using Optuna and Ray\[tune\], used by the CLI.
  - `train`: performs model training, used by the CLI.
  - `predict`: loads models from their checkpoint and config files and runs inference, used by the CLI.
@@ -155,52 +158,12 @@ To use the core `fastprop` model and dataloaders in your own work, consider look
 If you wish to extend the CLI, check the inline documentation there.
 
 # Benchmarks
-The `benchmarks` directory contains the scripts needed to perform the studies (see `benchmarks/README.md` for more detail, they are a great way to learn how to use `fastprop`) as well as the actual results, which are also summarized here.
-
-See the `benchmarks` or the `paper` for additional details for each benchmark, including a better description of what the 'literature best' is as well as more information about the reported performance metric.
-
-## Regression
-
-| Benchmark | Number Samples (k) | Metric | Literature Best | `fastprop` | Chemprop | Speedup | 
-|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
-| QM9 | ~130 | L1 | 0.0047 $^a$ | 0.0063 | 0.0081 $^a$ | ~ |
-| OCELOTv1 | ~25 | GEOMEAN(L1) | 0.128 $^b$ | 0.148 | 0.140 $^b$ | ~ |
-| QM8 | ~22 | L1 | 0.016 $^a$  | 0.016 | 0.019 $^a$ | ~ |
-| ESOL | ~1.1 | L2 | 0.55 $^c$ | 0.57 | 0.67 $^c$ | ~ |
-| FreeSolv | ~0.6 | L2 | 1.29 $^d$ | 1.06 | 1.37 $^d$ | ~ |
-| Flash | ~0.6 | MAPE/RMSE | 2.5/13.2 $^e$ | 2.7/13.5 | ~/21.2 $^x$ | 5m43s/1m20s |
-| YSI | ~0.4 | MdAE/MAE | 2.9~28.6 $^f$ | 8.3/20.2 | ~/21.8 $^x$ | 4m3s/2m15s |
-| HOPV15 Subset | ~0.3 | L1 | 1.32 $^g$ | 1.44 | WIP | WIP |
-| Fubrain | ~0.3 | L2 | 0.44 $^h$ | 0.19 | 0.22 $^x$ | 5m11s/54s |
-| PAH | ~0.06 | R2 | 0.99 $^g$ | 0.96 | 0.75 $^x$ | 36s/2m12s |
-
-## Classification
-
-| Benchmark | Number Samples (k) | Metric | Literature Best | `fastprop` | Chemprop | Speedup | 
-|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
-| HIV (binary) | ~41 | AUROC | 0.81 $^a$ | 0.81 | 0.77 $^a$ | ~ |
-| HIV (ternary) | ~41 | AUROC | ~ | 0.83 | WIP | ~ |
-| QuantumScents | ~3.5 | AUROC | 0.88 $^j$ | 0.91 | 0.85 $^j$ | ~ |
-| SIDER | ~1.4 | AUROC | 0.67 $^c$ | 0.66 | 0.57 $^c$ | ~ |
-| Pgp | ~1.3 | AUROC | WIP | 0.93 | WIP | ~ |
-| ARA | ~0.8 | Acc./AUROC | 0.91/0.95 $^k$ | 0.88/0.95 | 0.82/0.90 $^x$ | 16m54s/2m7s |
-
-### References
- - a: UniMol (10.26434/chemrxiv-2022-jjm0j-v4)
- - b: MHNN (10.48550/arXiv.2312.13136)
- - c: CMPNN (10.5555/3491440.3491832)
- - d: DeepDelta (10.1186/s13321-023-00769-x)
- - e: Saldana et al. (10.1021/ef200795j)
- - f: Das et al. (10.1016/j.combustflame.2017.12.005)
- - g: Eibeck et al. (10.1021/acsomega.1c02156)
- - h: Esaki et al. (10.1021/acs.jcim.9b00180)
- - i: Arockiaraj et al. (10.1080/1062936X.2023.2239149)
- - j: Burns et al. (10.1021/acs.jcim.3c01338)
- - k: DeepAR (10.1186/s13321-023-00721-z)
- - x: Run in this repository, see `benchmarks`.
+The `benchmarks` directory contains the scripts needed to perform the studies (see `benchmarks/README.md` for more detail, they are a great way to learn how to use `fastprop`).
+To just see the results, checkout [`paper/paper.pdf`](https://github.com/JacksonBurns/fastprop/blob/main/paper/paper.pdf) (or `paper/paper.md` for the plain text version).
 
 # Developing `fastprop`
 Bug reports, feature requests, and pull requests are welcome and encouraged!
+Follow [this tutorial from GitHub](https://docs.github.com/en/get-started/exploring-projects-on-github/contributing-to-a-project) to get started.
 
 `fastprop` is built around PyTorch lightning, which defines a rigid API for implementing models that is followed here.
 See the [section on the package layout](#python-module) for information on where all the other functions are, and check out the docstrings and inline comments in each file for more information on what each does.

diff --git a/benchmarks/ara/ara.yml b/benchmarks/ara/ara.yml
@@ -17,11 +17,6 @@
 output_directory: ara
 random_seed: 1989
 problem_type: binary
-# run hyperparameter optimization
-# optimize: True
-# optimized results
-# fnn_layers: 2
-# hidden_size: 256
 
 # featurization
 input_file: ara/benchmark_data.csv
@@ -30,6 +25,7 @@ smiles_column: Smiles
 descriptors: all
 
 # training
+hidden_size: 2300
 number_epochs: 40
 batch_size: 1024
 patience: 3

diff --git a/benchmarks/delta_fubrain/delta_fubrain.yml b/benchmarks/delta_fubrain/delta_fubrain.yml
@@ -45,16 +45,27 @@ descriptors: all
 enable_cache: True
 
 # preprocessing
-rescaling: True
 zero_variance_drop: False
 colinear_drop: False
 
 # training
+
+# re-use architecture from regular fubrain model
+# number_repeats: 4
+# number_epochs: 20
+# batch_size: 256
+# patience: 2
+# train_size: 0.64
+# val_size: 0.07
+# test_size: 0.29
+
+# match the deepdelta paper
 number_repeats: 10
 number_epochs: 15
 batch_size: 256
 patience: 20
 train_size: 0.90
 val_size: 0.01
 test_size: 0.10
+
 sampler: random
diff --git a/benchmarks/esol/esol.yml b/benchmarks/esol/esol.yml
@@ -15,7 +15,7 @@
 #
 # Additional Comments:
 # https://dl.acm.org/doi/10.5555/3491440.3491832 achieved 0.55 RMSE
-# OOB fastprop gets 0.6 RMSE, with modest optimization achieves 0.56
+# OOB fastprop gets 0.66 RMSE, with modest optimization achieves 0.60
 
 
 # generic args
@@ -25,8 +25,8 @@ problem_type: regression
 # run hyperparameter optimization
 # optimize: True
 # optimized results
-fnn_layers: 4
-hidden_size: 3000
+hidden_size: 1000
+fnn_layers: 2
 
 # featurization
 input_file: esol/benchmark_data.csv
@@ -36,15 +36,14 @@ descriptors: all
 enable_cache: True
 
 # preprocessing
-rescaling: True
 zero_variance_drop: False
 colinear_drop: False
 
 # training
-number_repeats: 3
+number_repeats: 8
 number_epochs: 200
 patience: 10
-train_size: 0.8
-val_size: 0.1
-test_size: 0.1
+train_size: 0.6
+val_size: 0.2
+test_size: 0.2
 sampler: random
diff --git a/benchmarks/flash/flash.yml b/benchmarks/flash/flash.yml
@@ -26,7 +26,6 @@ smiles_column: smiles
 descriptors: all
 
 # preprocessing
-rescaling: True
 zero_variance_drop: False
 colinear_drop: False
 

diff --git a/benchmarks/freesolv/freesolv.yml b/benchmarks/freesolv/freesolv.yml
@@ -21,14 +21,15 @@ descriptors: all
 
 # optimize: True
 # results after optimization
-fnn_layers: 5
 hidden_size: 100
+fnn_layers: 5
 
 # training
 number_epochs: 200
+patience: 3
 number_repeats: 8
 random_seed: 1701
-train_size: 0.8
-val_size: 0.1
-test_size: 0.1
+train_size: 0.6
+val_size: 0.2
+test_size: 0.2
 sampler: random
diff --git a/benchmarks/fubrain/fubrain.yml b/benchmarks/fubrain/fubrain.yml
@@ -3,13 +3,9 @@
 #
 # Download the data from ACS:
 # https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.9b00180/suppl_file/ci9b00180_si_002.xlsx
-# Then uncompress it, renaming HOPV_15_revised_2_processed_homo_5fold.csv
-# to benchmark_data.csv
+# and convert it to benchmark_data.csv
 #
 # Additional Comments:
-# Original study achieved an accuracy of 0.44 RMSE. DeepDelta 0.830 ± 0.023 RMSE so we will compare
-# to the original study instead.
-#
 # Original used the two external sets as test, 46 + 25 = 73 total in test set.
 # The overall data is 253 points, leaving 180 for training/validation.
 # With 10-fold cross validation this equates to 162 (90% of 180) in training
@@ -29,7 +25,6 @@ descriptors: all
 enable_cache: True
 
 # preprocessing
-rescaling: True
 zero_variance_drop: False
 colinear_drop: False
 

diff --git a/benchmarks/hiv/hiv.yml b/benchmarks/hiv/hiv.yml
@@ -29,15 +29,11 @@ target_columns: ternary_activity
 # generic args
 output_directory: hiv
 random_seed: 765309408
-# run hyperparameter optimization
-# optimize: True
-# optimized results
-# fnn_layers: 2
-# hidden_size: 2400
 
 # featurization
 input_file: hiv/benchmark_data.csv
 smiles_column: smiles
+descriptors: all
 precomputed: hiv/precomputed.csv
 
 # training

diff --git a/benchmarks/hopv15_subset/chemprop_hopv15_subset.sh b/benchmarks/hopv15_subset/chemprop_hopv15_subset.sh
@@ -0,0 +1,18 @@
+COUNTER=1
+
+while [ $COUNTER -le 15 ];
+do
+    chemprop_train \
+    --data_path benchmark_data.csv \
+    --smiles_columns smiles \
+    --target_columns pce \
+    --dataset_type regression \
+    --save_dir chemprop_hopv15_subset_${COUNTER}_logs \
+    --epochs 100 \
+    --split_sizes 0.6 0.2 0.2 \
+    --metric mae \
+    --extra_metrics rmse \
+    --batch_size 1024 \
+    --seed $COUNTER
+    COUNTER=$((COUNTER+1))
+done
diff --git a/benchmarks/hopv15_subset/hopv15_subset.yml b/benchmarks/hopv15_subset/hopv15_subset.yml
@@ -32,13 +32,14 @@ descriptors: all
 enable_cache: True
 
 # preprocessing
-rescaling: True
 zero_variance_drop: False
 colinear_drop: False
 
 # training
-number_repeats: 5
-number_epochs: 10
+hidden_size: 1100
+fnn_layers: 2
+number_repeats: 15
+number_epochs: 20
 patience: 2
 train_size: 0.6
 val_size: 0.2

diff --git a/benchmarks/ocelot/ocelot.yml b/benchmarks/ocelot/ocelot.yml
@@ -29,7 +29,6 @@ descriptors: all
 enable_cache: True
 
 # preprocessing
-rescaling: True
 zero_variance_drop: False
 colinear_drop: False