Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.0.0 release #1

Merged
merged 27 commits into from
Jan 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# Reference workflow provided by (c) GitHub
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: msannika_fdr

on:
push:
branches: [ master ]
pull_request:
branches: [ master ]

jobs:
build:

runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.7', '3.8', '3.9', '3.10', '3.11', '3.12']

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
- name: Copy scripts and data to "/tests"
run: |
cp msannika_fdr.py tests
cp data/DSSO_Crosslinks.xlsx .
cp data/DSSO_CSMs.xlsx .
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest tests/tests.py
191 changes: 190 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,190 @@
# MSAnnika_FDR
![workflow_state](https://github.com/hgb-bin-proteomics/MSAnnika_FDR/workflows/msannika_fdr/badge.svg)

# MS Annika FDR

A script and functions to group and validate [MS Annika](https://github.com/hgb-bin-proteomics/MSAnnika)
results. The main use case would be for re-validating results after filtering or
merging results from different MS Annika runs.

## Usage

- Install python 3.7+: [https://www.python.org/downloads/](https://www.python.org/downloads/)
- Install requirements: `pip install -r requirements.txt`
- Export MS Annika results from Proteome Discoverer to Microsoft Excel format.
- Run `python msannika_fdr.py filename.xlsx -fdr 0.01` (see below for more examples).
- The script may take a few minutes, depending on the number of CSMs/crosslinks to process.
- Done!

## Examples

`msannika_fdr.py` takes one positional and one optional argument. The first
argument always has to be the filename(s) of the MS Annika result file(s). You
may specify any number of result files, keep in mind however that
`msannika_fdr.py` will process these files seperately, if you want to merge
several result files, check out [MS Annika Combine Results](https://github.com/hgb-bin-proteomics/MSAnnika_Combine_Results).
For demonstration purposes we will use the files supplied in the `/data` folder:
- `DSSO_Crosslinks.xlsx` contains unvalidated crosslinks from an MS Annika
search.
- `DSSO_CSMs.xlsx` contains unvalidated CSMs from an MS Annika search.

The following is a valid `msannika_fdr.py` call:

```bash
python msannika_fdr.py DSSO_Crosslinks.xlsx
```

This will not do anything because no FDR was given. You should see in the output
that the script skipped the file. However, doing the same with a CSM file
results in a different output:

```bash
python msannika_fdr.py DSSO_CSMs.xlsx
```

This will group the CSMs by sequence and position to crosslinks and you should
see a file `DSSO_CSMs_crosslinks.xlsx` generated.

If you suppy the optional argument `-fdr` or `--false_discovery_rate` and the
desired FDR as a floating point number, the results will be validated:

```bash
python msannika_fdr.py DSSO_Crosslinks.xlsx -fdr 0.01
```

This will validate the input crosslinks for estimated 1% FDR and will generate a
a file called `DSSO_Crosslinks_validated.xlsx` containing only crosslinks above
the estimated 1% FDR threshold. Note that the following command will produce the
same output (FDR values >= 1 will automatically be divided by 100):

```bash
python msannika_fdr.py DSSO_Crosslinks.xlsx -fdr 1
```

Validating a CSMs file works the same way:

```bash
python msannika_fdr.py DSSO_CSMs.xlsx -fdr 0.01
```

This will will validate the input CSMs for estimated 1% FDR and will generate a
a file `DSSO_CSMs_validated.xlsx` containing only CSMs above the estimated 1%
FDR threshold. Furthermore, it will group the input CSMs to crosslinks and
output them to the file `DSSO_CSMs_crosslinks.xlsx` and then validate those
crosslinks for 1% estimated FDR and store the result in
`DSSO_CSMs_crosslinks_validated.xlsx`.

You can also supply several files to the script like this:

```bash
python msannika_fdr.py DSSO_CSMs.xlsx DSSO_Crosslinks.xlsx -fdr 0.01
```

This will process the input files seperately and sequentially and produce the
files as mentioned above:
- `DSSO_Crosslinks_validated.xlsx`
- `DSSO_CSMs_validated.xlsx`
- `DSSO_CSMs_crosslinks.xlsx`
- `DSSO_CSMs_crosslinks_validated.xlsx`

## Parameters

```python
"""
DESCRIPTION:
A script to group and validate results from MS Annika searches.
USAGE:
msannika_fdr.py f [f ...]
[-fdr FDR][--false_discovery_rate FDR]
[-h][--help]
[--version]
positional arguments:
f MS Annika result files in Microsoft Excel format (.xlsx)
to process.
optional arguments:
-fdr FDR, --false_discovery_rate FDR
False discovery rate to validate results for. Supports
both percentage input (e.g. 1) or fraction input (e.g.
0.01). By default not set and the input results will
just be grouped to crosslinks (if CSMs as input) or
nothing will be done (if crosslinks as input).
Default: None
-h, --help show this help message and exit
--version show program's version number and exit
"""
```

## Function Documentation

If you want to integrate the MS Annika FDR calculation into your own scripts,
you can import the following functions as given:

```python
import pandas as pd

crosslinks = pd.read_excel("DSSO_Crosslinks.xlsx")
csms = pd.read_excel("DSSO_CSMs.xlsx")

# Grouping CSMs to crosslinks
from msannika_fdr import MSAnnika_CSM_Grouper
Crosslinks_grouped_from_CSMs = MSAnnika_CSM_Grouper.group(csms)

# The function signature of MSAnnika_CSM_Grouper.group is:
def group(data: pd.DataFrame) -> pd.DataFrame:
"""code omitted"""
return

# Validating CSMs for 0.01 FDR
from msannika_fdr import MSAnnika_CSM_Validator
Validated_CSMs = MSAnnika_CSM_Validator.validate(csms, 0.01)

# The function signature of MSAnnika_CSM_Validator.validate is:
def validate(data: pd.DataFrame, fdr: float) -> pd.DataFrame:
"""code omitted"""
return

# Validating Crosslinks for 0.01 FDR
from msannika_fdr import MSAnnika_Crosslink_Validator
Validated_Crosslinks = MSAnnika_Crosslink_Validator.validate(crosslinks, 0.01)

# The function signature of MSAnnika_Crosslink_Validator.validate is:
def validate(data: pd.DataFrame, fdr: float) -> pd.DataFrame:
"""code omitted"""
return
```

## Known Issues

[List of known issues](https://github.com/hgb-bin-proteomics/MSAnnika_FDR/issues)

## Citing

If you are using the MS Annika FDR script please cite:
```
MS Annika 2.0 Identifies Cross-Linked Peptides in MS2–MS3-Based Workflows at High Sensitivity and Specificity
Micha J. Birklbauer, Manuel Matzinger, Fränze Müller, Karl Mechtler, and Viktoria Dorfer
Journal of Proteome Research 2023 22 (9), 3009-3021
DOI: 10.1021/acs.jproteome.3c00325
```

If you are using MS Annika please cite:
```
MS Annika 2.0 Identifies Cross-Linked Peptides in MS2–MS3-Based Workflows at High Sensitivity and Specificity
Micha J. Birklbauer, Manuel Matzinger, Fränze Müller, Karl Mechtler, and Viktoria Dorfer
Journal of Proteome Research 2023 22 (9), 3009-3021
DOI: 10.1021/acs.jproteome.3c00325
```
or
```
MS Annika: A New Cross-Linking Search Engine
Georg J. Pirklbauer, Christian E. Stieger, Manuel Matzinger, Stephan Winkler, Karl Mechtler, and Viktoria Dorfer
Journal of Proteome Research 2021 20 (5), 2560-2569
DOI: 10.1021/acs.jproteome.0c01000
```

## License

- [MIT](https://github.com/hgb-bin-proteomics/MSAnnika_FDR/blob/master/LICENSE)

## Contact

- [micha.birklbauer@fh-hagenberg.at](mailto:micha.birklbauer@fh-hagenberg.at)
Binary file added data/DSSO_CSMs.xlsx
Binary file not shown.
Binary file added data/DSSO_Crosslinks.xlsx
Binary file not shown.
Loading
Loading