Skip to content

Commit

Permalink
Merge pull request #4 from JacksonBurns/b2_patches
Browse files Browse the repository at this point in the history
Beta release 3 (1.0.0b3) - this release fixes an issue with feature scaling that led to some data leakage (model results should not be appreciably different, but let's cross our I and dot our Ts) and has a complete draft of the corresponding paper.
  • Loading branch information
JacksonBurns authored Feb 28, 2024
2 parents 24341f7 + c3b7bfc commit 64b7775
Show file tree
Hide file tree
Showing 41 changed files with 1,382 additions and 1,510 deletions.
13 changes: 5 additions & 8 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,17 +69,14 @@ jobs:
coverage report -m
ci-report-status:
if: ${{ always() }}
name: report CI status
needs: build-and-test
needs: [build-and-test]
runs-on: ubuntu-latest
steps:
- run: |
result="${{ needs.build-and-test.result }}"
if [[ $result == "success" ]] ; then
exit 0
else
exit 1
fi
- run: exit 1
# see https://stackoverflow.com/a/67532120/4907315
if: ${{ contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') || contains(needs.*.result, 'skipped') }}

check-for-new-release:
runs-on: ubuntu-latest
Expand Down
3 changes: 0 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,6 @@ benchmarks/ocelot/174705_1_supplements.zip
benchmarks/ocelot/ocelot_chromophore_v1.zip
benchmarks/ocelot/ocelot_chromophore_v1/readme.txt

# paper build files
paper/paper.pdf

# todo's
benchmarks/tox21
benchmarks/ace
Expand Down
59 changes: 11 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
</p>

# Announcement - Open Beta!
`fastprop` is currently in the version 1 open beta!
`fastprop` is currently in the version 3 open beta (1.0.0b3)!
Please try `fastprop` on your datasets and let us know what you think.
Feature requests and bug reports are **very** appreciated!

Expand Down Expand Up @@ -74,12 +74,14 @@ There are four distinct steps in `fastprop` that define its framework:
_or_
- Load precomputed descriptors: filepath to where descriptors are already cached either manually or by `fastprop`
2. Preprocessing
- Enable/Disable re-scaling of parameters between 0 and 1 (enabled by default and _highly_ recommended)
- Enable/Disable dropping of zero-variance parameters (disabled by default; faster, but often less accurate)

~~- Enable/Disable dropping of co-linear descriptors (disabled by default; faster, decreased accuracy)~~ _WIP_
- _not configurable_: `fastprop` will always drop columns with no values and impute missing values with the mean per-column
- _not configurable_: `fastprop` will always rescale input features, drop columns with no values, and impute missing values with the per-feature mean
3. Training
- Number of Repeats: How many times to split/train/test on the dataset (increments random seed by 1 each time).

_and_
- Number of FNN layers (default 2; repeated fully connected layers of hidden size)
- Hidden Size: number of neurons per FNN layer (default 1800)

Expand All @@ -90,8 +92,9 @@ There are four distinct steps in `fastprop` that define its framework:
- Output Directory
- Learning rate
- Batch size

~~- Checkpoint file to resume from (optional)~~ _WIP_
- Problem type (one of: regression, binary, multiclass, multilabel)
- Problem type (one of: regression, binary, multiclass (start labels from 0), multilabel)
4. Prediction
- Input SMILES: either a single SMILES or a CSV file
- Output format: filepath to write the results or nothing, defaults to stdout
Expand Down Expand Up @@ -135,7 +138,7 @@ To use the core `fastprop` model and dataloaders in your own work, consider look

### `fastprop`
- `defaults`: contains the function `init_logger` used to initialize loggers in different submodules, as well as the default configuration for training.
- `fastprop_core`: the model itself, data PyTorch Lightning dataloader, and convenience functions.
- `fastprop_core`: the model itself and convenience functions.
- `hopt`: hyperparameter optimization using Optuna and Ray\[tune\], used by the CLI.
- `train`: performs model training, used by the CLI.
- `predict`: loads models from their checkpoint and config files and runs inference, used by the CLI.
Expand All @@ -155,52 +158,12 @@ To use the core `fastprop` model and dataloaders in your own work, consider look
If you wish to extend the CLI, check the inline documentation there.

# Benchmarks
The `benchmarks` directory contains the scripts needed to perform the studies (see `benchmarks/README.md` for more detail, they are a great way to learn how to use `fastprop`) as well as the actual results, which are also summarized here.

See the `benchmarks` or the `paper` for additional details for each benchmark, including a better description of what the 'literature best' is as well as more information about the reported performance metric.

## Regression

| Benchmark | Number Samples (k) | Metric | Literature Best | `fastprop` | Chemprop | Speedup |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| QM9 | ~130 | L1 | 0.0047 $^a$ | 0.0063 | 0.0081 $^a$ | ~ |
| OCELOTv1 | ~25 | GEOMEAN(L1) | 0.128 $^b$ | 0.148 | 0.140 $^b$ | ~ |
| QM8 | ~22 | L1 | 0.016 $^a$ | 0.016 | 0.019 $^a$ | ~ |
| ESOL | ~1.1 | L2 | 0.55 $^c$ | 0.57 | 0.67 $^c$ | ~ |
| FreeSolv | ~0.6 | L2 | 1.29 $^d$ | 1.06 | 1.37 $^d$ | ~ |
| Flash | ~0.6 | MAPE/RMSE | 2.5/13.2 $^e$ | 2.7/13.5 | ~/21.2 $^x$ | 5m43s/1m20s |
| YSI | ~0.4 | MdAE/MAE | 2.9~28.6 $^f$ | 8.3/20.2 | ~/21.8 $^x$ | 4m3s/2m15s |
| HOPV15 Subset | ~0.3 | L1 | 1.32 $^g$ | 1.44 | WIP | WIP |
| Fubrain | ~0.3 | L2 | 0.44 $^h$ | 0.19 | 0.22 $^x$ | 5m11s/54s |
| PAH | ~0.06 | R2 | 0.99 $^g$ | 0.96 | 0.75 $^x$ | 36s/2m12s |

## Classification

| Benchmark | Number Samples (k) | Metric | Literature Best | `fastprop` | Chemprop | Speedup |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| HIV (binary) | ~41 | AUROC | 0.81 $^a$ | 0.81 | 0.77 $^a$ | ~ |
| HIV (ternary) | ~41 | AUROC | ~ | 0.83 | WIP | ~ |
| QuantumScents | ~3.5 | AUROC | 0.88 $^j$ | 0.91 | 0.85 $^j$ | ~ |
| SIDER | ~1.4 | AUROC | 0.67 $^c$ | 0.66 | 0.57 $^c$ | ~ |
| Pgp | ~1.3 | AUROC | WIP | 0.93 | WIP | ~ |
| ARA | ~0.8 | Acc./AUROC | 0.91/0.95 $^k$ | 0.88/0.95 | 0.82/0.90 $^x$ | 16m54s/2m7s |

### References
- a: UniMol (10.26434/chemrxiv-2022-jjm0j-v4)
- b: MHNN (10.48550/arXiv.2312.13136)
- c: CMPNN (10.5555/3491440.3491832)
- d: DeepDelta (10.1186/s13321-023-00769-x)
- e: Saldana et al. (10.1021/ef200795j)
- f: Das et al. (10.1016/j.combustflame.2017.12.005)
- g: Eibeck et al. (10.1021/acsomega.1c02156)
- h: Esaki et al. (10.1021/acs.jcim.9b00180)
- i: Arockiaraj et al. (10.1080/1062936X.2023.2239149)
- j: Burns et al. (10.1021/acs.jcim.3c01338)
- k: DeepAR (10.1186/s13321-023-00721-z)
- x: Run in this repository, see `benchmarks`.
The `benchmarks` directory contains the scripts needed to perform the studies (see `benchmarks/README.md` for more detail, they are a great way to learn how to use `fastprop`).
To just see the results, checkout [`paper/paper.pdf`](https://github.com/JacksonBurns/fastprop/blob/main/paper/paper.pdf) (or `paper/paper.md` for the plain text version).

# Developing `fastprop`
Bug reports, feature requests, and pull requests are welcome and encouraged!
Follow [this tutorial from GitHub](https://docs.github.com/en/get-started/exploring-projects-on-github/contributing-to-a-project) to get started.

`fastprop` is built around PyTorch lightning, which defines a rigid API for implementing models that is followed here.
See the [section on the package layout](#python-module) for information on where all the other functions are, and check out the docstrings and inline comments in each file for more information on what each does.
Expand Down
6 changes: 1 addition & 5 deletions benchmarks/ara/ara.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,6 @@
output_directory: ara
random_seed: 1989
problem_type: binary
# run hyperparameter optimization
# optimize: True
# optimized results
# fnn_layers: 2
# hidden_size: 256

# featurization
input_file: ara/benchmark_data.csv
Expand All @@ -30,6 +25,7 @@ smiles_column: Smiles
descriptors: all

# training
hidden_size: 2300
number_epochs: 40
batch_size: 1024
patience: 3
Expand Down
13 changes: 12 additions & 1 deletion benchmarks/delta_fubrain/delta_fubrain.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,16 +45,27 @@ descriptors: all
enable_cache: True

# preprocessing
rescaling: True
zero_variance_drop: False
colinear_drop: False

# training

# re-use architecture from regular fubrain model
# number_repeats: 4
# number_epochs: 20
# batch_size: 256
# patience: 2
# train_size: 0.64
# val_size: 0.07
# test_size: 0.29

# match the deepdelta paper
number_repeats: 10
number_epochs: 15
batch_size: 256
patience: 20
train_size: 0.90
val_size: 0.01
test_size: 0.10

sampler: random
15 changes: 7 additions & 8 deletions benchmarks/esol/esol.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
#
# Additional Comments:
# https://dl.acm.org/doi/10.5555/3491440.3491832 achieved 0.55 RMSE
# OOB fastprop gets 0.6 RMSE, with modest optimization achieves 0.56
# OOB fastprop gets 0.66 RMSE, with modest optimization achieves 0.60


# generic args
Expand All @@ -25,8 +25,8 @@ problem_type: regression
# run hyperparameter optimization
# optimize: True
# optimized results
fnn_layers: 4
hidden_size: 3000
hidden_size: 1000
fnn_layers: 2

# featurization
input_file: esol/benchmark_data.csv
Expand All @@ -36,15 +36,14 @@ descriptors: all
enable_cache: True

# preprocessing
rescaling: True
zero_variance_drop: False
colinear_drop: False

# training
number_repeats: 3
number_repeats: 8
number_epochs: 200
patience: 10
train_size: 0.8
val_size: 0.1
test_size: 0.1
train_size: 0.6
val_size: 0.2
test_size: 0.2
sampler: random
1 change: 0 additions & 1 deletion benchmarks/flash/flash.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@ smiles_column: smiles
descriptors: all

# preprocessing
rescaling: True
zero_variance_drop: False
colinear_drop: False

Expand Down
9 changes: 5 additions & 4 deletions benchmarks/freesolv/freesolv.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,15 @@ descriptors: all

# optimize: True
# results after optimization
fnn_layers: 5
hidden_size: 100
fnn_layers: 5

# training
number_epochs: 200
patience: 3
number_repeats: 8
random_seed: 1701
train_size: 0.8
val_size: 0.1
test_size: 0.1
train_size: 0.6
val_size: 0.2
test_size: 0.2
sampler: random
7 changes: 1 addition & 6 deletions benchmarks/fubrain/fubrain.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,9 @@
#
# Download the data from ACS:
# https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.9b00180/suppl_file/ci9b00180_si_002.xlsx
# Then uncompress it, renaming HOPV_15_revised_2_processed_homo_5fold.csv
# to benchmark_data.csv
# and convert it to benchmark_data.csv
#
# Additional Comments:
# Original study achieved an accuracy of 0.44 RMSE. DeepDelta 0.830 ± 0.023 RMSE so we will compare
# to the original study instead.
#
# Original used the two external sets as test, 46 + 25 = 73 total in test set.
# The overall data is 253 points, leaving 180 for training/validation.
# With 10-fold cross validation this equates to 162 (90% of 180) in training
Expand All @@ -29,7 +25,6 @@ descriptors: all
enable_cache: True

# preprocessing
rescaling: True
zero_variance_drop: False
colinear_drop: False

Expand Down
6 changes: 1 addition & 5 deletions benchmarks/hiv/hiv.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,11 @@ target_columns: ternary_activity
# generic args
output_directory: hiv
random_seed: 765309408
# run hyperparameter optimization
# optimize: True
# optimized results
# fnn_layers: 2
# hidden_size: 2400

# featurization
input_file: hiv/benchmark_data.csv
smiles_column: smiles
descriptors: all
precomputed: hiv/precomputed.csv

# training
Expand Down
18 changes: 18 additions & 0 deletions benchmarks/hopv15_subset/chemprop_hopv15_subset.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
COUNTER=1

while [ $COUNTER -le 15 ];
do
chemprop_train \
--data_path benchmark_data.csv \
--smiles_columns smiles \
--target_columns pce \
--dataset_type regression \
--save_dir chemprop_hopv15_subset_${COUNTER}_logs \
--epochs 100 \
--split_sizes 0.6 0.2 0.2 \
--metric mae \
--extra_metrics rmse \
--batch_size 1024 \
--seed $COUNTER
COUNTER=$((COUNTER+1))
done
7 changes: 4 additions & 3 deletions benchmarks/hopv15_subset/hopv15_subset.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,14 @@ descriptors: all
enable_cache: True

# preprocessing
rescaling: True
zero_variance_drop: False
colinear_drop: False

# training
number_repeats: 5
number_epochs: 10
hidden_size: 1100
fnn_layers: 2
number_repeats: 15
number_epochs: 20
patience: 2
train_size: 0.6
val_size: 0.2
Expand Down
1 change: 0 additions & 1 deletion benchmarks/ocelot/ocelot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ descriptors: all
enable_cache: True

# preprocessing
rescaling: True
zero_variance_drop: False
colinear_drop: False

Expand Down
Loading

0 comments on commit 64b7775

Please sign in to comment.