Added program documentation

stephensolis · Jul 21, 2018 · c8f351a · c8f351a
1 parent 0331140
commit c8f351a
Show file tree

Hide file tree

Showing 9 changed files with 160 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -32,6 +32,76 @@
     </a>
 </p>
 
+## Installing
+
+There are three ways to install this software. Choose whichever one is best for your needs:
+
+**1. If you already have Python 2.7 or 3.4+ installed (recommended):**
+
+Run `pip install kameris`.
+
+**2. If you do not have Python installed or are unable to install software:**
+
+[Click here](https://github.com/stephensolis/kameris/releases/latest) and download the version corresponding to your operating system.
+If you use Linux or macOS, you may need to run `chmod +x "path to downloaded program"`.
+
+**3. If you are a developer or want to build your own version of Kameris:**
+
+Clone this repository then run `make install`.
+
+## Quick demo
+
+This software is able to train sequence classification models and use them to make predictions.
+
+Before following these instructions, make sure you've installed the software.
+If you followed option **1** above and the command `kameris` doesn't work for you, try using `python -m kameris` instead.
+If you followed option **2** above and downloaded an executable, replace `kameris` in the instructions below with the name of the executable you downloaded.
+
+### Classifying sequences with an existing model
+
+First, let's classify some HIV-1 sequences.
+
+1. Start by downloading this zip file containing HIV-1 genomes, and extract it to a folder: https://raw.githubusercontent.com/stephensolis/kameris/master/demo/hiv1-genomes.zip.
+2. Run `kameris classify hiv1-mlp "path to extracted files"`
+
+This will output the top subtype match for each sequence and write all results to a new file `results.json`.
+
+The `hiv1-mlp` model is able to give class probabilities and a ranked list of predictions, but some models are only able to report the top match. For example, try `kameris classify hiv1-linearsvm "path to extracted files"`
+
+To see other available models, go to https://github.com/stephensolis/kameris-experiments/tree/master/models.
+
+### Training a new model
+
+Now, let's train our own HIV-1 sequence classification models.
+
+1. Create an empty folder and open a terminal in the folder.
+2. Create folders `data` and `output`.
+3. Run `kameris run-job https://raw.githubusercontent.com/stephensolis/kameris/master/demo/hiv1-lanl.yml https://raw.githubusercontent.com/stephensolis/kameris/master/demo/settings.yml`
+
+Depending on your computer's performance and internet speed, it may take 5-10 minutes to run.
+This will automatically download the required datasets and train a simpler version of the [hiv1/lanl-whole experiment from kameris-experiments](https://github.com/stephensolis/kameris-experiments).
+This was the exact job used to train the models from the previous section, and these are the same models used in the paper ["An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes"](https://www.biorxiv.org/content/early/2018/07/05/362780).
+
+Now, open `output/hiv1-lanl-whole`. You will notice folders were created for each value of `k`. Within each folder are several files:
+- `fasta` contains the FASTA files extracted from the downloaded dataset used for model training and evaluation.
+- `metadata.json` contains metadata on the FASTA files used to determine the class for each sequence.
+- `cgrs.mm-repr` contains feature vectors for each sequence. See the mentioned paper for more technical details.
+- `classification-kmers.json` contains evaluation results after using cross-validation on the dataset. See the mentioned paper for more technical details.
+- The `.mm-model` files contain trained models which may be passed to `kameris classify` in order to classify new sequences. **Note** that models trained using Python 2 will not run under Python 3 and vice-versa.
+- `log.txt` is a log file containing all the output printed during job execution.
+- `rerun-experiment.yml` is a file which may be passed to `kameris run-job` in order to re-run the job and obtain exactly the files found in this directory.
+
+Kameris also includes functionality to summarize results in easy-to-read tables. Try it by running `kameris summarize output/hiv1-lanl-whole`.
+
+You can change the settings used to train the model: first download the files [hiv1-lanl.yml](https://raw.githubusercontent.com/stephensolis/kameris/master/demo/hiv1-lanl.yml) and [settings.yml](https://raw.githubusercontent.com/stephensolis/kameris/master/demo/settings.yml).
+Training settings are found in `hiv1-lanl.yml` -- try changing the value of `k` or uncommenting different classifier types.
+File storage and logging settings are found in `settings.yml`.
+After making changes, run `kameris run-job hiv1-lanl.yml settings.yml` to train your model.
+
+[//]: # (## Documentation)
+
+## Dependencies
+
 This project uses:
 
 - [stephensolis/kameris-backend](https://github.com/stephensolis/kameris-backend) to generate k-mer count vectors and distance matrices

diff --git a/demo/hiv1-genomes.zip b/demo/hiv1-genomes.zip
diff --git a/demo/hiv1-lanl.yml b/demo/hiv1-lanl.yml
@@ -0,0 +1,53 @@
+name: hiv1-lanl-whole
+
+experiments:
+  subtype:
+    expand_options:
+      k: 5..6
+    min_group_pts: 18
+    include_recombinants: true
+    dataset:
+      archive: hiv1
+      archive_folder: lanl-whole
+      metadata: hiv1-lanl-whole
+    selection_key: subtype
+    groups: |
+      lambda options, metadata:
+        import collections
+        counts = collections.Counter(x[options['selection_key']] for x in metadata)
+        return {v: {'selection_key': options['selection_key'], 'values': [v]} for v in counts if v and counts[v] >= options['min_group_pts']}
+
+steps:
+  - type: select
+    copy_for_options: [k]
+    pick_group: |
+      lambda metadata, group_options, options:
+        return [x for x in metadata if (options['include_recombinants'] or not x['recombinant']) and
+                                       x[group_options['selection_key']] in group_options['values']]
+
+  - type: kmers
+    output_file: cgrs.mm-repr
+    mode: frequencies
+    k: from_options
+    bits_per_element: 16
+
+  - type: classify
+    features_file: cgrs.mm-repr
+    output_file: classification-kmers.json
+    validation_count: 10
+    classifiers:
+      #- 10-nearest-neighbors
+      #- nearest-centroid-mean
+      #- nearest-centroid-median
+      #- logistic-regression
+      #- sgd
+      - linear-svm
+      #- quadratic-svm
+      #- cubic-svm
+      #- decision-tree
+      #- random-forest
+      #- adaboost
+      #- gaussian-naive-bayes
+      #- lda
+      #- qda
+      - multilayer-perceptron
diff --git a/samples/settings.yml → demo/settings.yml b/samples/settings.yml → demo/settings.yml
@@ -2,11 +2,11 @@
 # this is required
 local_dirs:
   # the directory containing zipped datasets
-  archives: /data/archives
+  archives: data
   # the directory containing JSON metadata files
-  metadata: /data/metadata
+  metadata: data
   # the directory for storage of job output
-  output: /data/output
+  output: output
 
 # if desired, specifies an external service to use for logging
 # this is optional

diff --git a/kameris/__init__.py b/kameris/__init__.py
@@ -1,3 +1,3 @@
 from __future__ import unicode_literals
 
-__version__ = '0.6.dev1'
+__version__ = '1.0.0'
diff --git a/kameris/__main__.py b/kameris/__main__.py
@@ -44,7 +44,8 @@ def main():
     except Exception as e:
         log = logging.getLogger('kameris')
         message = 'an unexpected error occurred: {}: {}'.format(
-            type(e).__name__, e.message or str(e)
+            type(e).__name__,
+            (e.message if hasattr(e, 'message') else '') or str(e)
         )
         if log.handlers:
             log.error(message)

diff --git a/kameris/schemas/file_urls.json b/kameris/schemas/file_urls.json
@@ -2,19 +2,33 @@
     "$schema": "http://json-schema.org/draft-04/schema#",
     "type": "object",
     "properties": {
-        "metadata": {"$ref": "#/definitions/url_list"},
-        "archives": {"$ref": "#/definitions/url_list"},
-        "models": {"$ref": "#/definitions/url_list"}
+        "metadata": {
+            "type": "object",
+            "additionalProperties": {"$ref": "#/definitions/url"}
+        },
+        "archives": {
+            "type": "object",
+            "additionalProperties": {"$ref": "#/definitions/url"}
+        },
+        "models": {
+            "type": "object",
+            "additionalProperties": {
+                "type": "object",
+                "properties": {
+                    "python2": {"$ref": "#/definitions/url"},
+                    "python3": {"$ref": "#/definitions/url"}
+                },
+                "additionalProperties": false,
+                "required": ["python2", "python3"]
+            }
+        }
     },
     "additionalProperties": false,
 
     "definitions": {
-        "url_list": {
-            "type": "object",
-            "additionalProperties": {
-                "type": "string",
-                "pattern": "http(s)?://.*"
-            }
+        "url": {
+            "type": "string",
+            "pattern": "http(s)?://.*"
         }
     }
 }
diff --git a/kameris/subcommands/classify.py b/kameris/subcommands/classify.py
@@ -25,7 +25,7 @@ def run(args):
             model_url = args.model
         else:
             model_url = download_utils.url_for_file(args.model + '.mm-model',
-                                                    args.urls_file, 'model')
+                                                    args.urls_file, 'models')
         model_file = download_utils.open_url_cached(model_url, 'rb',
                                                     args.force_download)
 

diff --git a/kameris/utils/download_utils.py b/kameris/utils/download_utils.py
@@ -8,6 +8,7 @@
 import requests
 from ruamel.yaml import YAML
 from six.moves import urllib
+import sys
 from tqdm import tqdm
 
 from . import defaults, fs_utils, job_utils
@@ -53,7 +54,11 @@ def url_for_file(path, urls_file, filetype):  # NOQA (cache line above)
         ))
 
     filename = os.path.splitext(os.path.basename(path))[0]
-    return urls[filetype][filename]
+    if filetype == 'models':
+        python_ver = 'python{}'.format(sys.version_info.major)
+        return urls[filetype][filename][python_ver]
+    else:
+        return urls[filetype][filename]
 
 
 def open_url_cached(url, mode, force_download=False):
@@ -63,7 +68,7 @@ def open_url_cached(url, mode, force_download=False):
                              'cache')
     fs_utils.mkdir_p(cache_dir)
 
-    cache_key = hashlib.md5(url).hexdigest()
+    cache_key = hashlib.md5(url.encode('utf-8')).hexdigest()
     cache_filename = os.path.join(cache_dir, cache_key)
     if not force_download and os.path.exists(cache_filename):
         log.info("file '%s' already downloaded, using cached version", url)