-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.Rmd
executable file
·390 lines (302 loc) · 15.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
---
output:
github_document:
toc: true
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = FALSE,
comment = "",
out.width = "100%"
)
```
# RNAsum
`RNAsum` is an R package that can post-process, summarise and visualise
outputs primarily from [DRAGEN RNA][dragen-rna] pipelines.
Its main application is to complement whole-genome based findings and to provide additional evidence for detected
alterations.
[dragen-rna]: <https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html>
[umccrise]: <https://github.com/umccr/umccrise>
**DOCS**: <https://umccr.github.io/RNAsum>
## Installation
- **R** package can be installed directly from the [GitHub source][rnasum-gh]:
[rnasum-gh]: <https://github.com/umccr/RNAsum>
```r
remotes::install_github("umccr/RNAsum") # latest main commit
remotes::install_github("umccr/RNAsum@v0.0.X") # version 0.0.X
remotes::install_github("umccr/RNAsum@abcde") # commit abcde
remotes::install_github("umccr/RNAsum#123") # PR 123
```
- **Conda** package is available from the Anaconda [umccr channel][conda]:
[conda]: <https://anaconda.org/umccr/r-rnasum>
```bash
conda install r-rnasum==X.X.X -c umccr -c conda-forge -c bioconda
```
- **Docker** image is available from the [GitHub Container Registy][ghcr]:
[ghcr]: <https://github.com/umccr/RNAsum/pkgs/container/rnasum>
```bash
docker pull ghcr.io/umccr/rnasum:latest
```
## Workflow
The pipeline consists of five main components illustrated and briefly
described below. For more details, see [workflow.md](./inst/articles/workflow.md).
<img src="man/figures/RNAsum_workflow_updated.png" width="100%">
1. Collect patient **WTS data** from the [DRAGEN RNA][dragen-rna] pipeline
including per-gene read counts and gene fusions.
2. Add expression data from **[reference cohorts](#reference-data)** to
get an idea about the expression levels of genes of interest in other cancer patient
cohorts. The read counts are normalised, transformed and converted into a scale
that allows to present the patient's expression measurements in the context of the
reference cohorts.
3. Supply **genome-based findings** from whole-genome sequencing (WGS) data to
focus on genes of interest and to provide additional evidence for
dysregulation of mutated genes, or genes located within detected structural
variants (SVs) or copy-number (CN) altered regions. `RNAsum` is designed to
be compatible with WGS patient outputs generated from `umccrise`.
4. Collate results with knowledge derived from in-house resources and public
databases to provide additional sources of evidence for clinical significance
of altered genes e.g. to flag variants with clinical significance or
potential druggable targets.
5. The final product is an interactive HTML report with searchable tables and
plots presenting expression levels of the genes of interest. The report
consists of several sections described [here](./inst/articles/report_structure.md).
## Reference data
The reference expression data are available for **33 cancer types** and were
derived from [external](#external-reference-cohorts)
([TCGA](https://tcga-data.nci.nih.gov/)) and [internal](#internal-reference-cohort)
([UMCCR](https://mdhs.unimelb.edu.au/centre-for-cancer-research/our-research/precision-oncology-research-group))
resources.
### External reference cohorts
In order to explore expression changes in the patient, we have built a
high-quality pancreatic cancer reference cohort.
Depending on the tissue from which the patient's sample was taken, one of
**33 cancer datasets** from TCGA can be used as a reference cohort for comparing
expression changes in genes of interest of the patient. Additionally, 10 samples
from each of the 33 TCGA datasets were combined to create the
**[Pan-Cancer dataset](./inst/articles/tcga_projects_summary.md#pan-cancer-dataset)**,
and for some cohorts **[extended sets](./inst/articles/tcga_projects_summary.md#extended-datasets)**
are also available. All available datasets are listed in the
**[TCGA projects summary table](./inst/articles/tcga_projects_summary.md)**. These datasets
have been processed using methods described in the
[TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/blob/master/expression/README.md#gdc-counts-data)
repository. The dataset of interest can be specified by using one of the
TCGA project IDs for the `RNAsum` `--dataset` argument (see
[Examples](#examples)).
### Internal reference cohort
The publicly available TCGA datasets are expected to demonstrate prominent
[batch effects](https://www.ncbi.nlm.nih.gov/pubmed/20838408) when compared to
the in-house WTS data due to differences in applied experimental procedures and
analytical pipelines. Moreover, TCGA data may include samples from tissue
material of lower quality and cellularity compared to samples processed using
local protocols. To address these issues, we have built a high-quality internal
reference cohort processed using the same pipelines as input data
(see [data pre-processing](./inst/articles/workflow.md#data-processing)).
This internal reference set of **40 pancreatic cancer samples** is based on WTS
data generated at **[UMCCR](https://research.unimelb.edu.au/centre-for-cancer-research/our-research/precision-oncology-research-group)**
and processed with the **bcbio-nextgen RNA-seq**
pipeline to minimise potential batch effects between investigated samples and
the reference cohort and to make sure the data are comparable. The internal
reference cohort assembly is summarised in the
[Pancreatic-data-harmonization](https://github.com/umccr/Pancreatic-data-harmonization/tree/master/expression/in-house)
repository.
**Note**
There are two rationales for using the internal reference cohort:
1. In case of **pancreatic cancer samples** this cohort is used:
- in ***batch effects correction***
- as a reference point for ***comparing per-gene expression levels***
observed in the data of the patient of interest and data from other
pancreatic cancer patients.
2. In case of samples from **any cancer type** the data from the internal
reference cohort is used in the ***batch effects correction*** procedure
performed to minimise technical-related variation in the data.
## Input data
`RNAsum` accepts [WTS](#wts) data processed by the state-of-the-art bioinformatic
tools such as kallisto and salmon for quantification and Arriba for fusion calling.
RNAsum can aso process and combine fusion output from Illumina's Dragen pipeline.
Additionally, the WTS data can be integrated with [WGS](#wgs)-based data processed
using the tools discussed in the section [WGS](#wgs).
In the latter case, the genome-based findings from the
corresponding sample are incorporated into the report and are used as a
primary source for expression profile prioritisation.
### WTS
The only required WTS input data are **read counts** provided in a
quantification file.
#### RNA
The table below lists all input data accepted in `RNAsum`:
| Input file | Tool | Example | Required |
| -------------------------------------- | ---------------------------------------------- | ----------------------------------------------- | ---------- |
| Quantified transcript **abundances** | [salmon][salmon] ([description][salmon-res]) | [*.quant.sf][salmon-ex] | **Yes** |
| Quantified gene **abundances** | [salmon][salmon] ([description][salmon-res]) | [*.quant.gene.sf][salmon-ex2] | **Yes** |
| **Fusion gene** list | [Arriba][arriba] | [fusions.tsv][dragen-rna-ex] | No |
| **Fusion gene** list | [DRAGEN RNA][dragen-rna] | [*.fusion_candidates.final][dragen-rna-ex] | No |
[salmon]: <https://salmon.readthedocs.io/en/latest/salmon.html>
[salmon-ex]: </inst/rawdata/test_data/dragen/TEST.quant.sf>
[salmon-ex2]: </inst/rawdata/test_data/dragen/TEST.quant.gene.sf>
[salmon-res]: <https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats>
[arriba]: <https://arriba.readthedocs.io/en/latest/>
[arriba-res]: </inst/rawdata/test_data/final/test_sample_WTS/arriba/fusions.tsv>
[dragen-rna]: <https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html>
[dragen-rna-ex]: </inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final>
### WGS
`RNAsum` is designed to be compatible with WGS outputs.
The table below lists all input data accepted in `RNAsum`:
| Input file | Tool | Example | Required |
|---------------------|------------------|---------------------------------------|----------|
| **SNVs/Indels** | [PCGR][pcgr] | [pcgr.snvs_indels.tiers.tsv][pcgr-ex] | No |
| **CNVs** | [PURPLE][purple] | [purple.cnv.gene.tsv][purple-ex] | No |
| **SVs** | [Manta][manta] | [sv-prioritize-manta.tsv][manta-ex] | No |
[pcgr]: <https://github.com/sigven/pcgr>
[pcgr-ex]: </inst/rawdata/test_data/umccrised/test_sample_WGS/small_variants/pcgr.snvs_indels.tiers.tsv>
[purple]: <https://github.com/hartwigmedical/hmftools/tree/master/purple>
[purple-ex]: </inst/rawdata/test_data/umccrised/test_sample_WGS/purple/purple.gene.cnv>
[manta]: <https://github.com/Illumina/manta>
[manta-ex]: </inst/rawdata/test_data/umccrised/test_sample_WGS/structural/sv-prioritize-manta.tsv>
## Usage
```{bash echo=TRUE, eval=FALSE}
rnasum_cli=$(Rscript -e 'x = system.file("cli", package = "RNAsum"); cat(x, "\n")' | xargs)
export PATH="${rnasum_cli}:${PATH}"
```
```{bash echo=FALSE}
rnasum_cli=$(Rscript -e 'x = system.file("cli", package = "RNAsum"); cat(x, "\n")' | xargs)
export PATH="${rnasum_cli}:${PATH}"
echo "$ rnasum.R --version"
rnasum.R --version
echo ""
echo "$ rnasum.R --help"
rnasum.R --help
echo ""
```
**Note**
Human reference genome
***[GRCh38](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39)***
(*Ensembl* based annotation version ***105***) is used for gene annotation by
default. GRCh37 is no longer supported.
### Examples
Below are `RNAsum` CLI commands for generating HTML reports under different
data availability scenarios:
1. [WTS and WGS data](#1-wts-and-wgs-data)
2. [WTS data only](#2-wts-data-only)
3. [WTS WGS and clinical data](#3-wts-wgs-and-clinical-data)
**Note**
* Example data is provided in the `/inst/rawdata/test_data` folder of the GitHub
[repo][rnasum-gh].
* The `RNAsum` runtime should be less than **15 minutes** using **16GB RAM**
memory and **1 CPU**.
#### 1. WTS and WGS data
This is the **most frequent and preferred case**, in which the
[WGS](#wgs)-based findings will be used as a primary source for expression
profile prioritisation. The genome-based results can be incorporated into the
report by specifying the location of the corresponding
output files (including results from `PCGR`, `PURPLE`, and `Manta`).
The **`Mutated genes`**, **`Structural variants`** and **`CN altered genes`** report sections will
contain information about expression levels of the mutated genes,
genes located within detected SVs and CN altered regions, respectively.
The results in the **`Fusion genes`** section will be ordered based on the
evidence from genome-based data. A subset of the TCGA pancreatic
adenocarcinoma dataset is used as reference cohort (`--dataset TEST `).
```bash
rnasum.R \
--sample_name test_sample_WTS \
--dataset TEST \
--dragen_wts_dir inst/rawdata/test_data/dragen \
--report_dir inst/rawdata/test_data/dragen/RNAsum \
--umccrise inst/rawdata/test_data/umccrised/test_sample_WGS \
--save_tables FALSE
```
The HTML report `test_sample_WTS.RNAsum.html` will be created in the
`inst/rawdata/test_data/dragen/RNAsum` folder.
#### 2. WTS data only
In this scenario, only [WTS](#wts) data will be used and only expression levels
of key **[`Cancer genes`](https://github.com/umccr/umccrise/blob/master/workflow.md#key-cancer-genes)**,
**`Fusion genes`**, **`Immune markers`** and homologous recombination deficiency
genes (**`HRD genes`**) will be reported. Moreover, gene fusions reported in
the `Fusion genes` report section will not contain information about evidence
from genome-based data. A subset of the TCGA pancreatic adenocarcinoma dataset
is used as the reference cohort (`--dataset TEST`).
```bash
rnasum.R \
--sample_name test_sample_WTS \
--dataset TEST \
--dragen_wts_dir inst/rawdata/test_data/dragen \
--report_dir inst/rawdata/test_data/dragen/RNAsum \
--save_tables FALSE
```
The output HTML report `test_sample_WTS.RNAsum.html` will be created in the
`inst/rawdata/test_data/dragen/RNAsum` folder.
#### 3. WTS WGS and clinical data
For samples derived from subjects, for which clinical information is available,
a treatment regimen timeline can be added to the HTML report. This can be added
by specifying the location of a relevant excel spreadsheet (see example
`test_clinical_data.xlsx` under `inst/rawdata/test_data/test_clinical_data.xlsx`) using the
`--clinical_info` argument. In this spreadsheet, at least one of the following
columns is expected: `NEOADJUVANT REGIMEN`, `ADJUVANT REGIMEN`,
`FIRST LINE REGIMEN`, `SECOND LINE REGIMEN` or `THIRD LINE REGIMEN`,
along with `START` and `STOP` dates of corresponding treatments.
A subset of the TCGA pancreatic adenocarcinoma dataset is used as the reference
cohort (`--dataset TEST `).
```bash
rnasum.R \
--sample_name test_sample_WTS \
--dataset TEST \
--dragen_wts_dir $(pwd)/../rawdata/test_data/dragen \
--report_dir $(pwd)/../rawdata/test_data/dragen/RNAsum \
--umccrise $(pwd)/../rawdata/test_data/umccrised/test_sample_WGS \
--save_tables FALSE \
--clinical_info $(pwd)/../rawdata/test_data/test_clinical_data.xlsx \
--save_tables FALSE
```
The HTML report `test_sample_WTS.RNAsum.html` will be created in the
`../rawdata/test_data/stratus/test_sample_WTS_dragen_v3.9.3/RNAsum` folder.
### Output
The pipeline generates a HTML ***Patient Transcriptome Summary***
**[report](#report)** and a [results](#results) folder:
```text
|
|____<output>
|____<SampleName>.<output>.html
|____results
|____exprTables
|____glanceExprPlots
|____...
```
#### Report
The generated HTML report includes searchable tables and interactive plots
presenting expression levels of altered genes, as well as links to public
resources describing the genes of interest. The report consists of several
sections, including:
* Input data
* Clinical information\*
* Findings summary
* Mutated genes\*\*
* Fusion genes
* Structural variants\*\*
* CN altered genes\*\*
* Immune markers
* HRD genes
* Cancer genes
* Drug matching
\* if clinical information is available; see `--clinical_info` argument <br />
\*\* if genome-based results are available; see `--umccrise` argument
Detailed description of the report structure, including result prioritisation
and visualisation is available [here](./inst/articles/report_structure.md).
#### Results
The `results` folder contains intermediate files, including plots and tables
that are presented in the HTML report.
#### Code of Conduct
The code of conduct can be accessed [here](./CODE_OF_CONDUCT.md).
#### Citation
To cite package ‘RNAsum’ in publications use:
> Kanwal S, Marzec J, Diakumis P, Hofmann O, Grimmond S (2024). “RNAsum: An R
package to comprehensively post-process, summarise and visualise genomics and
transcriptomics data.” version 1.1.0, <https://umccr.github.io/RNAsum/>.
A BibTeX entry for LaTeX users is
```
@Unpublished{,
title = {RNAsum: An R package to comprehensively post-process, summarise and visualise genomics and transcriptomics data},
author = {Sehrish Kanwal and Jacek Marzec and Peter Diakumis and Oliver Hofmann and Sean Grimmond},
year = {2024},
note = {version 1.1.0},
url = {https://umccr.github.io/RNAsum/},
}
```