Skip to content

Commit

Permalink
Improvements in documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Old-Shatterhand committed Sep 3, 2024
1 parent f41c122 commit 76c633d
Show file tree
Hide file tree
Showing 6 changed files with 16,261 additions and 9 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@
- [ ] Replace GraKel with something "modern" and fully "conda-installable" to make DataSAIL fully conda-installable
- [ ] Include [MashMap3](https://github.com/marbl/MashMap)
- [ ] Include MASH for amino acid sequences
- [ ] Custom clustering methods ([Issue #25](https://github.com/kalininalab/DataSAIL/issues/25))

## v1.0.1 (2024-05-08) till v1.0.7 (2024-06-27)

- Bug fixes in stratification

## v1.0.0 (2024-04-04)

Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ pip install grakel
to install DataSAIL in an already existing environment. Alternatively, one can install DataSAIL-lite from conda.
DataSAIL-lite is a version of DataSAIL that does not install all clustering algorithms as the standard DataSAIL.

DataSAIL is available from Python 3.8 and newer.
DataSAIL is available for Python 3.8 and newer.

## Usage

Expand All @@ -55,7 +55,7 @@ datasail --e-type P --e-data <path_to_fasta> --e-sim mmseqs --output <path_to_ou
````

to split a set of proteins that have been clustered using mmseqs. For a full list of arguments, run `datasail -h` and
checkout [ReadTheDocs](https://datasail.readthedocs.io/en/latest/index.html).
checkout [ReadTheDocs](https://datasail.readthedocs.io/en/latest/index.html). There is a more detailed explanation of the arguments and example notebooks.

## When to use DataSAIL and when not to use

Expand All @@ -73,7 +73,7 @@ different from your training data but not if the data in the application is more

If you used DataSAIL to split your data, please cite DataSAIL in your publication.
````
@article{joeres2022datasail,
@article{joeres2023datasail,
title={DataSAIL: Data Splitting Against Information Leakage},
author={Joeres, Roman and Blumenthal, David B. and Kalinina, Olga V},
journal={bioRxiv},
Expand Down
6 changes: 3 additions & 3 deletions experiments/DTI/split.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,9 +146,9 @@ def split_w_graphpart(base_path: Path) -> None:

def main(path):
split_w_datasail(path, TECHNIQUES["datasail"])
# split_w_deepchem(path, TECHNIQUES["deepchem"])
# split_w_lohi(path)
# split_w_graphpart(path)
split_w_deepchem(path, TECHNIQUES["deepchem"])
split_w_lohi(path)
split_w_graphpart(path)


if __name__ == '__main__':
Expand Down
3 changes: 1 addition & 2 deletions experiments/DTI/visualize.py
Original file line number Diff line number Diff line change
Expand Up @@ -552,6 +552,5 @@ def plot(full_path: Path):


if __name__ == '__main__':
# plot(Path(sys.argv[1]))
comp_il()

plot(Path(sys.argv[1]))
33 changes: 32 additions & 1 deletion experiments/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,35 @@

-------------

blub
For the publication, we have conducted several experiments:

1. Splitting of data for drug-target interaction data,
2. Splitting of data for Molecular Property Prediction,
3. Splitting of data with samples belonging to either of two classes for stratified splits,

and some ablation studies based on above's data. The experiments cover all possible applications of DataSAIL. Each
experiments-folder is structured in the same way:

1. `split.py`: Contains the code used for splitting using DataSAIL or baselines tools.
2. `train.py`: Contains the code to train the different models on the split data.
3. `visualize.py`: Contains the code to visualize the results of the training.

All can be executed in the same way:

```shell
python -m experiments.<experiment>.<script> <path/to/storage-folder>
```

where `<experiment>` is the name of the experiment type (`DTI`, `MPP`, or `Strat`) and `<script>` is the name of the
script (`split`, `train`, or `visualize`). Lastly <path/to/storage-folder> is the path to a folder where the results
from the previous step can be found and new results shall be stored. Because the scripts rely on the results from the
previous step, it is necessary to run them in order. For example, to run the entire DTI experiment pipeline, you need
to run:

```shell
python -m experiments.DTI.split scratch/DataSAIL_results/DTI
python -m experiments.DTI.train scratch/DataSAIL_results/DTI
python -m experiments.DTI.visualize scratch/DataSAIL_results/DTI
```

where the path can be exchanged with any other path.
Loading

0 comments on commit 76c633d

Please sign in to comment.