This repository contains all codes related to our publication found here
For manageable dataset: less than 1000 objects/items in total to analyze, you can use pythonanywhere platform: MIASA app
Unfortunately, the above app is not feasible for larger datasets due to computational power limitations.
This workflow was tested on macOS Monterey Version 12.5 and CentOS Linux 7 (Core)
version 3.10.4
Packages: numpy (1.21.5), scipy (1.7.3) pip, pandas (1.4.3), seaborn, regex, scikit-learn (1.1.3), matplotlib (3.5.2), openpyxl, xlrd, statsmodels (0.13.2) sklearn.som
Conda will manage the dependencies of our pipeline. Instructions can be found here: https://docs.conda.io/projects/conda/en/latest/user-guide/install
Create a new environment from the given environment config in env.yml
conda env create -f env/env.yml
This step may take a few minutes.
To activate the enviromnent
conda activate MIASA
Install other packages with pip
pip install scikit-learn-extra
pip install xlrd
pip install sklearn.som
pip install umap-learn
Higher versions of the above packages could be also be suitable for MIASA. Feel free to contact us in case you encounter unresolved problems.
Because of package changes and updates (clustering methods, UMAP, t-SNE, ...), the figures shown in the Manuscript might not be reproducible. However, results obtained should still be essentially the same.
This folder contains all the codes that were used to produce the manuscripts results
Add environment to jupyter notebook
conda deactivate
(only if the environment is activated)
conda install -c anaconda ipykernel
(only if ipykernel
is not yet installed)
python -m ipykernel install --user --name=MIASA
miasa_Dist.ipynb
, miasa_Corr.ipynb
, miasa_GRN.ipynb
: python code for using MIASA for the three dataset problems highlighted in the paper (similarity distances are Euclidean).
NB: the folder dataset 2mRNA_100000
for two-Gene regulatory models must be dowloaded from (here) and placed in the folder Manuscript_examples/Data/
(without changing the folder names)
miasa_NonEucl_Dist.ipynb
, miasa_NonEucl_Corr.ipynb
, miasa_NonEucl_GRN.ipynb
: using MIASA for the three dataset problems highlighted in the paper (similarity distances are non-Euclidean).
miasa_Dist_SOM.ipynb
, miasa_Dist_SOM_MIASA.ipynb
, miasa_Dist_NN.ipynb
, miasa_Dist_SVM.ipynb
: machine learning experiments using MIASA for the distribution dataset as highlighted in the paper (similarity distances are non-Euclidean).
class_experiment.py
: python code for classification experiments when the true clusters are known and included in the data generating function which must return data in a specific format (e.g. function generate_data_dist
in module Methods/simulate_class_data.py
)
For a general execution of the MIASA framework
Snakemake is the workflow management system we use. Install it in your activated environment like this:
conda install -c conda-forge -c bioconda snakemake
NOTE: In case conda is not able to find the packages for snakemake (which was the case for the Linux version), you can install mamba in your environment
conda install -c conda-forge mamba
and download snakemake with
mamba install -c conda-forge -c bioconda snakemake
Detailed Snakemake installation instruction using mamba can be found here: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
These variables are stored in config.yaml
.
For more information about the YAML markup format refer to documentation: https://yaml.org
All input datasets must be pre-processed to follow the following requirements
- Format must be
.xlsx
or.csv
- A column
variable
must be included indicating the variable labels (see all example datasets in the folderdataset_requirement
)
"Agglomerative_*"# where * is a linkage method of `sklearn.cluster.AgglomerativeClustering',
"Kmeans", # sklearn.cluster.KMeans
"Kmedoids", # sklearn_extra.cluster.KMedoids
"Spectral", # sklearn.cluster.SpectralClustering
"GMM", # sklearn.mixture.GaussianMixture
BayesianGMM", # sklearn.mixture.BayesianGaussianMixture
"DBSCAN", # sklearn.cluster.DBSCAN
"MLPClassifer/<true labels file (.xlsx or .csv) with a column true label aligned with original dataset Xs first and then the Ys (see demo/true_labels_files.xlsx)>/<percentage train randomly chosen within the true labels>" # "/" separated, sklearn.neural_network.MLPClassifer using the parameters of manuscript results
"MLPRegressor/<true labels file (.xlsx or .csv) with a column true label aligned with original dataset Xs first and then then Ys (see demo/true_labels_files.xlsx)>/<percentage train randomly chosen within the true labels>" # sklearn.neural_network.MLPRegressor using the parameters of manuscript results
"SVM_SVC/<true labels file (.xlsx or .csv) with a column true label aligned with original dataset Xs first and then the Ys (see demo/true_labels_files.xlsx)>/<percentage train randomly chosen within the true labels>" # sklearn.svm.SVC using the parameters of manuscript results
"SOM/<a positive number (that will be multiplied with 1/c3zeta to give the learning rate parameter)>" # sklearn.som default initialization
If your environment is not yet activated, type
conda activate MIASA
Go to the pipeline directory (where the Snakefile named MIASA
is located) and enter the following command to execute the pipeline
snakemake --snakefile MIASA --configfile path/to/config.yaml -j -d path/to/workdir
**CAUTION**: Please delete the all the files generated in the folder `plots/`
and re-run the above code line to make sure that the plots corresponds to the results
With parameter --configfile
you can give the configuration file, described above. The -j
parameter determines the number of available CPU cores to use in the pipeline. Optionally you can provide the number of cores, e.g. -j 4
. With parameter -d
you can set the work directory, i.e. where the results of the pipeline are written to.
The main pipeline (config.yaml
) creates a folder results, containing all (intermediate) output, with the following structure:
|-- results
|-- qEE_Transformed_Dataset.xlsx/csv # standard machine readable formats of the transformed dataset via qEE-Transition (with 10 decimals numerical precision)
|-- miasa_results.pck # pickled python dictionary containing the results with essential keys: "Coords" (embedded coordinates on the rows: for X dataset starting from row 1 and ordered as in the original dataset with or without an origin, at row index M+1 and the rest is the Y dataset ordered as in the original dataset) and "Class_pred" (the predicted cluster indexes for the rows of "Coords")
|-- plots
|-- UMAP_One_Panel.pdf/.svg # UMAP projection of the results
|-- UMAP_Separate_Panels.pdf/.svg # UMAP projection separate predicted panels
|-- tSNE_One_Panel.pdf/.svg # t-SNE projection of the results
|-- tSNE_Separate_Panels.pdf/.svg # t-SNE projection of the results, separate predicted panels
|-- scores
|-- scored_miasa_results.pck # pickled python dictionary containing the results including cluster score vectors in the keys "silhouette", "elbow", "distortion", corresponding to the array of number of clusters saved as key "list_num".
|-- Cluster_scores.pdf/.svg # cluster score plots (Elbow, Distortion, Silhouette)
Demo datasets are provided in the repository folder demo
If your environment is not yet activated, type
conda activate MIASA
To run the pipeline go into the repository where the snakefile MIASA
is located and run
snakemake --snakefile MIASA --configfile demo/demo_config.yaml -j -d demo
**CAUTION**: Please delete the all the files generated in the folder `plots/`
and re-run the above code line to make sure that the plots corresponds to the results
Less than 5 min
Deactivate the environment to exit the pipeline
conda deactivate
The result folder is created in the demo
folder where you find the output files, as described above.
The projection of all predicted cluster on the same pannel indicating convex hull of predicted clusters (some outliers excluded) and labels of true cluster members (if true labels are given as a parameter in config file)
The projection of all predicted cluster separated pannels:
- With convex hull of prediction (some outliers excluded)
- Without convex hull of prediction
Cluster evaluation (Wikipedia)
Caution must be taken for all re-parameterization of simulations made with config.yaml
, snakemake does not execute the rules for which the result files are already present (unless an input file is updated by another rule), remove older files from the results folder when needed.
Some package-related issues might still arise during code excecution, however, most solutions to these issues can be found online. For example here are some issue we encountered
Error message about parameter issues in snakemake file. This might be a snakefile formating issue, which can be solved by
First install snakefmt into the MIASA
enviroment
pip install snakefmt
Then, when needed, reformat snakefile
snakefmt MIASA
In case you had to interrupt snakemake run code (e.g. by Ctr + Z), you need to remove the folder workdir/.snakemake/locks/
rm -rf workdir/.snakemake/locks
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed
Typing the following sequence of code solves this issue (see stackoverflow)
pip uninstall -y numpy
pip uninstall -y setuptools
pip install setuptools
pip install numpy
File ... from scipy.optimize import root ... .../usr/lib/liblapack.3.dylib (no such file)
This is a problem with scipy that is resolved by unistalling and re-installing scipy with pip
saturncloud
pip unistall scipy
pip install scipy
AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'
Solution:
pip uninstall pyOpenSSL
pip install pyOpenSSL
Library not loaded: @rpath/liblapack.3.dylib
Solution:
pip install --upgrade --force-reinstall scikit-learn
If the conda installation fails, please use the following commands to install it manually:
conda create --name MIASA
conda activate MIASA
conda install -c conda-forge -c bioconda -c anaconda python==3.10.4 numpy==1.21.5 scipy==1.7.3 openpyxl pandas==1.4.3 matplotlib seaborn joblib regex pip scikit-learn==1.1.3
Proceed above or use pip
to install all other possible missing packages prompt by error messages