Skip to content

Commit

Permalink
Added program documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
stephensolis committed Jul 21, 2018
1 parent 0331140 commit c8f351a
Show file tree
Hide file tree
Showing 9 changed files with 160 additions and 17 deletions.
70 changes: 70 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,76 @@
</a>
</p>

## Installing

There are three ways to install this software. Choose whichever one is best for your needs:

**1. If you already have Python 2.7 or 3.4+ installed (recommended):**

Run `pip install kameris`.

**2. If you do not have Python installed or are unable to install software:**

[Click here](https://github.com/stephensolis/kameris/releases/latest) and download the version corresponding to your operating system.
If you use Linux or macOS, you may need to run `chmod +x "path to downloaded program"`.

**3. If you are a developer or want to build your own version of Kameris:**

Clone this repository then run `make install`.

## Quick demo

This software is able to train sequence classification models and use them to make predictions.

Before following these instructions, make sure you've installed the software.
If you followed option **1** above and the command `kameris` doesn't work for you, try using `python -m kameris` instead.
If you followed option **2** above and downloaded an executable, replace `kameris` in the instructions below with the name of the executable you downloaded.

### Classifying sequences with an existing model

First, let's classify some HIV-1 sequences.

1. Start by downloading this zip file containing HIV-1 genomes, and extract it to a folder: https://raw.githubusercontent.com/stephensolis/kameris/master/demo/hiv1-genomes.zip.
2. Run `kameris classify hiv1-mlp "path to extracted files"`

This will output the top subtype match for each sequence and write all results to a new file `results.json`.

The `hiv1-mlp` model is able to give class probabilities and a ranked list of predictions, but some models are only able to report the top match. For example, try `kameris classify hiv1-linearsvm "path to extracted files"`

To see other available models, go to https://github.com/stephensolis/kameris-experiments/tree/master/models.

### Training a new model

Now, let's train our own HIV-1 sequence classification models.

1. Create an empty folder and open a terminal in the folder.
2. Create folders `data` and `output`.
3. Run `kameris run-job https://raw.githubusercontent.com/stephensolis/kameris/master/demo/hiv1-lanl.yml https://raw.githubusercontent.com/stephensolis/kameris/master/demo/settings.yml`

Depending on your computer's performance and internet speed, it may take 5-10 minutes to run.
This will automatically download the required datasets and train a simpler version of the [hiv1/lanl-whole experiment from kameris-experiments](https://github.com/stephensolis/kameris-experiments).
This was the exact job used to train the models from the previous section, and these are the same models used in the paper ["An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes"](https://www.biorxiv.org/content/early/2018/07/05/362780).

Now, open `output/hiv1-lanl-whole`. You will notice folders were created for each value of `k`. Within each folder are several files:
- `fasta` contains the FASTA files extracted from the downloaded dataset used for model training and evaluation.
- `metadata.json` contains metadata on the FASTA files used to determine the class for each sequence.
- `cgrs.mm-repr` contains feature vectors for each sequence. See the mentioned paper for more technical details.
- `classification-kmers.json` contains evaluation results after using cross-validation on the dataset. See the mentioned paper for more technical details.
- The `.mm-model` files contain trained models which may be passed to `kameris classify` in order to classify new sequences. **Note** that models trained using Python 2 will not run under Python 3 and vice-versa.
- `log.txt` is a log file containing all the output printed during job execution.
- `rerun-experiment.yml` is a file which may be passed to `kameris run-job` in order to re-run the job and obtain exactly the files found in this directory.

Kameris also includes functionality to summarize results in easy-to-read tables. Try it by running `kameris summarize output/hiv1-lanl-whole`.

You can change the settings used to train the model: first download the files [hiv1-lanl.yml](https://raw.githubusercontent.com/stephensolis/kameris/master/demo/hiv1-lanl.yml) and [settings.yml](https://raw.githubusercontent.com/stephensolis/kameris/master/demo/settings.yml).
Training settings are found in `hiv1-lanl.yml` -- try changing the value of `k` or uncommenting different classifier types.
File storage and logging settings are found in `settings.yml`.
After making changes, run `kameris run-job hiv1-lanl.yml settings.yml` to train your model.

[//]: # (## Documentation)

## Dependencies

This project uses:

- [stephensolis/kameris-backend](https://github.com/stephensolis/kameris-backend) to generate k-mer count vectors and distance matrices
Expand Down
Binary file added demo/hiv1-genomes.zip
Binary file not shown.
53 changes: 53 additions & 0 deletions demo/hiv1-lanl.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: hiv1-lanl-whole

experiments:
subtype:
expand_options:
k: 5..6
min_group_pts: 18
include_recombinants: true
dataset:
archive: hiv1
archive_folder: lanl-whole
metadata: hiv1-lanl-whole
selection_key: subtype
groups: |
lambda options, metadata:
import collections
counts = collections.Counter(x[options['selection_key']] for x in metadata)
return {v: {'selection_key': options['selection_key'], 'values': [v]} for v in counts if v and counts[v] >= options['min_group_pts']}
steps:
- type: select
copy_for_options: [k]
pick_group: |
lambda metadata, group_options, options:
return [x for x in metadata if (options['include_recombinants'] or not x['recombinant']) and
x[group_options['selection_key']] in group_options['values']]
- type: kmers
output_file: cgrs.mm-repr
mode: frequencies
k: from_options
bits_per_element: 16

- type: classify
features_file: cgrs.mm-repr
output_file: classification-kmers.json
validation_count: 10
classifiers:
#- 10-nearest-neighbors
#- nearest-centroid-mean
#- nearest-centroid-median
#- logistic-regression
#- sgd
- linear-svm
#- quadratic-svm
#- cubic-svm
#- decision-tree
#- random-forest
#- adaboost
#- gaussian-naive-bayes
#- lda
#- qda
- multilayer-perceptron
6 changes: 3 additions & 3 deletions samples/settings.yml → demo/settings.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
# this is required
local_dirs:
# the directory containing zipped datasets
archives: /data/archives
archives: data
# the directory containing JSON metadata files
metadata: /data/metadata
metadata: data
# the directory for storage of job output
output: /data/output
output: output

# if desired, specifies an external service to use for logging
# this is optional
Expand Down
2 changes: 1 addition & 1 deletion kameris/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from __future__ import unicode_literals

__version__ = '0.6.dev1'
__version__ = '1.0.0'
3 changes: 2 additions & 1 deletion kameris/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,8 @@ def main():
except Exception as e:
log = logging.getLogger('kameris')
message = 'an unexpected error occurred: {}: {}'.format(
type(e).__name__, e.message or str(e)
type(e).__name__,
(e.message if hasattr(e, 'message') else '') or str(e)
)
if log.handlers:
log.error(message)
Expand Down
32 changes: 23 additions & 9 deletions kameris/schemas/file_urls.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,33 @@
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"metadata": {"$ref": "#/definitions/url_list"},
"archives": {"$ref": "#/definitions/url_list"},
"models": {"$ref": "#/definitions/url_list"}
"metadata": {
"type": "object",
"additionalProperties": {"$ref": "#/definitions/url"}
},
"archives": {
"type": "object",
"additionalProperties": {"$ref": "#/definitions/url"}
},
"models": {
"type": "object",
"additionalProperties": {
"type": "object",
"properties": {
"python2": {"$ref": "#/definitions/url"},
"python3": {"$ref": "#/definitions/url"}
},
"additionalProperties": false,
"required": ["python2", "python3"]
}
}
},
"additionalProperties": false,

"definitions": {
"url_list": {
"type": "object",
"additionalProperties": {
"type": "string",
"pattern": "http(s)?://.*"
}
"url": {
"type": "string",
"pattern": "http(s)?://.*"
}
}
}
2 changes: 1 addition & 1 deletion kameris/subcommands/classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def run(args):
model_url = args.model
else:
model_url = download_utils.url_for_file(args.model + '.mm-model',
args.urls_file, 'model')
args.urls_file, 'models')
model_file = download_utils.open_url_cached(model_url, 'rb',
args.force_download)

Expand Down
9 changes: 7 additions & 2 deletions kameris/utils/download_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import requests
from ruamel.yaml import YAML
from six.moves import urllib
import sys
from tqdm import tqdm

from . import defaults, fs_utils, job_utils
Expand Down Expand Up @@ -53,7 +54,11 @@ def url_for_file(path, urls_file, filetype): # NOQA (cache line above)
))

filename = os.path.splitext(os.path.basename(path))[0]
return urls[filetype][filename]
if filetype == 'models':
python_ver = 'python{}'.format(sys.version_info.major)
return urls[filetype][filename][python_ver]
else:
return urls[filetype][filename]


def open_url_cached(url, mode, force_download=False):
Expand All @@ -63,7 +68,7 @@ def open_url_cached(url, mode, force_download=False):
'cache')
fs_utils.mkdir_p(cache_dir)

cache_key = hashlib.md5(url).hexdigest()
cache_key = hashlib.md5(url.encode('utf-8')).hexdigest()
cache_filename = os.path.join(cache_dir, cache_key)
if not force_download and os.path.exists(cache_filename):
log.info("file '%s' already downloaded, using cached version", url)
Expand Down

0 comments on commit c8f351a

Please sign in to comment.