survival analysis (breast cancer)

The two datasets that were used in the project are METABRIC and TCGA.

In this project, we mainly use the TCGA-BRCA dataset and this dataset includes nine disease types:

Adenomas and Adenocarcinomas
Adnexal and Skin Appendage Neoplasms
Basal Cell Neoplasms
Complex Epithelial Neoplasms
Cystic, Mucinous and Serous Neoplasms
Ductal and Lobular Neoplasms
Epithelial Neoplasms, NOS
Fibroepithelial Neoplasms
Squamous Cell Neoplasms

TCGA-BRCA

The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes to subjects from TCGA. The specific subsets of TCGA-BRCA are applied include clinical data, RNASeqGene, methylation data, and mutation data.

clinical data

The TCGA-BRCA clinical data include the followings of 1097 patients:

years_to_birth
vital_status
days_to_death
days_to_last_followup
tumor_tissue_site
pathologic_stage
gender
race
ethnicity

vital_status (diagnosis): the state or condition of being living or deceased; also includes the case where the vital status is unknown. days_to_last_followup (diagnosis): time interval from the date of last followup to the date of initial pathologic diagnosis, represented as a calculated number of days.

Dataset	#patients	#alive	#deceased	median age*	age range*	gender (M:F)
TCGA-BRCA	1097	945**	152**	59	26-90	12:1085

* 15 patients declined to reveal their ages, so age data of 1082 patients are calculated.

** calculated from the vital_status in clinical data, which conforms to the GDC portal, but contradicts to Wenric2018.

RNA Sequencing

RNA-Seq, also called whole transcriptome shotgun sequencing, uses next-generation sequencing to reveal the presence the quantity of RNA in a biological sample at a given moment. Each sample contrains raw_counts, median_length_normalized, and RPKM for every gene.

RPKM (Reads Per Kilobase Million):

This metric attempts to normalize for sequencing depth and gene length. Read counts are normalized for differences in sequencing depth and normalized for gene size. How is how the RPKM is calculated.

count up the total reads in a sample and divide that number by 1,000,000 - this is our 'per million' scaling factor
divide the read counts by the 'per million' scaling factor, which normalizes for sequencing depth, giving you reads per million (RPM)
divide the RPM values by the length of the gene, in kilobases, which gives you RPKM

In this case, RPKM = {raw_counts / [sum(raw_counts) / 1000000]} / median_length_nomalized

$RPKM = \frac{raw_counts \times 1000000}{median_length_nomalized \times \sum_{} raw_counts}$

In the project, only RPKM data are retained.

Label RNA Sequencing data:

After labeling RNA sequencing data with vital_status in clincal data, the shape of processed Clinical data is (779, 18) and the shape of processed RNASeq data is (878, 20533+1), where 1 is the column of labels (0 or 1). Note here that the number of sample in RNASeq data is a little larger than that in Clinical data. This is because for some sample in RNASeq data, two pieces of samples belong to one patient. For instance, TCGA-A7-A13G is the code of the patient, but TCGA-A7-A13G-01A and TCGA-A7-A13G-11A co-exists. The underlying reason awaits to be explored.

Highlights in Olson2018

This paper, 'Data-driven advice for applying machine learning to bioinformatics problems', analyses 13 state-of-the-art commonly-used machine learning algorithms on a set of 165 classification problems. The 13 ML algorithms come from the sklearn implementations with individual hyperparameters, and the 165 classification problem datasets come from the Penn Machine Learning Benchmark (PMLB).

Authors concluded that both selecting the right ML algorithm and tuning its parameters are vitally importantly for most problems. Furthermore, they listed their top five recommended algorithms on 165 datasets. It is important to note that these algorithms and parameters will not work best on all supervised classification problems, and they should only be used as starting points. Here is the full list of top five recommendations:

GradientBoostingClassifier (51/16)
- loss = 'deviance'
- learning_rate = 0.1
- n_estimators = 500
- max_depth = 3
- max_features = 'log2'
RandomForestClassifier (19/165)
- n_estimators = 500
- max_features = 0.25
- criterion = 'entropy'
SVC (16/165)
- C = 0.01
- gamma = 0.1
- kernel = 'poly'
- degree = 3
- coef0 = 10.0
ExtraTreesClassifier (12/165)
- n_estimators = 1000
- max_features = 'log2'
- criterion = 'entropy'
LogisticRegression
- C = 1.5
- penalty = 'l1'
- fit_intercept = True

Survival Analysis

Traditionally, survival analysis was developed to measure lifespans of individuals. The analysis can be further applied to not just traditional births and deaths, but any duration. Medical professionals might be interested in the time between childbirths, where a birth in this case is the event of having a child, and a death is becoming pregnant again (obviously, we are loose with our definitions of birth and death).

At the time you want to make inferences about durations, it is possible that not all the death events have occured yet. The individuals in a population who have not been subject to the death event are labeled as right-censored, i.e., we can not view the rest of their life history due to some external circumstances. All the information we have on these individuals are their current lifetime durations, which is naturally less than their actual lifetimes. There is also left-censorship, where an individual's birth event is not seen.

A common mistake data analysts make a choosing to ignore the right-censored individuals. Survival analysis was originally developed to solve this type of problem, to deal with estimation when our data is right-censored. Even in the case where all events have been observed, i.e. no censorship, survival analysis is still a very useful tool to understand durations.

Comparison of classifiers' performance

classification accuracies

RNASeq	Random Forest (RFs)	Gradient Tree Boosting (GBDT)	XGBoost
raw count
RPKM

comparison of their performance using survival analysis

p-values of every gene signature, using RNA raw count as features and tested on KM Estimate p-values of every gene signature, using RNA RPKM as features and tested on KM Estimate

Statistics of p-values in above figures

the number of p-values < 0.01	Random Forests	Gradient Tree Boosting	XGBoost
raw counts	6	8	7
RPKM	2	12	9

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
results		results
results_RPKM		results_RPKM
src		src
test		test
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

survival analysis (breast cancer)

Table of Contents

TCGA-BRCA

clinical data

RNA Sequencing

RPKM (Reads Per Kilobase Million):

Label RNA Sequencing data:

Highlights in Olson2018

Survival Analysis

Comparison of classifiers' performance

About

Releases

Packages

Languages

ZihengZZH/survival-analysis

Folders and files

Latest commit

History

Repository files navigation

survival analysis (breast cancer)

Table of Contents

TCGA-BRCA

clinical data

RNA Sequencing

RPKM (Reads Per Kilobase Million):

Label RNA Sequencing data:

Highlights in Olson2018

Survival Analysis

Comparison of classifiers' performance

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages