The two datasets that were used in the project are METABRIC and TCGA.
In this project, we mainly use the TCGA-BRCA dataset and this dataset includes nine disease types:
- Adenomas and Adenocarcinomas
- Adnexal and Skin Appendage Neoplasms
- Basal Cell Neoplasms
- Complex Epithelial Neoplasms
- Cystic, Mucinous and Serous Neoplasms
- Ductal and Lobular Neoplasms
- Epithelial Neoplasms, NOS
- Fibroepithelial Neoplasms
- Squamous Cell Neoplasms
The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes to subjects from TCGA. The specific subsets of TCGA-BRCA are applied include clinical data, RNASeqGene, methylation data, and mutation data.
The TCGA-BRCA clinical data include the followings of 1097 patients:
- years_to_birth
- vital_status
- days_to_death
- days_to_last_followup
- tumor_tissue_site
- pathologic_stage
- gender
- race
- ethnicity
vital_status (diagnosis): the state or condition of being living or deceased; also includes the case where the vital status is unknown. days_to_last_followup (diagnosis): time interval from the date of last followup to the date of initial pathologic diagnosis, represented as a calculated number of days.
Dataset | #patients | #alive | #deceased | median age* | age range* | gender (M:F) |
---|---|---|---|---|---|---|
TCGA-BRCA | 1097 | 945** | 152** | 59 | 26-90 | 12:1085 |
* 15 patients declined to reveal their ages, so age data of 1082 patients are calculated.
** calculated from the vital_status in clinical data, which conforms to the GDC portal, but contradicts to Wenric2018.
RNA-Seq, also called whole transcriptome shotgun sequencing, uses next-generation sequencing to reveal the presence the quantity of RNA in a biological sample at a given moment. Each sample contrains raw_counts, median_length_normalized, and RPKM for every gene.
This metric attempts to normalize for sequencing depth and gene length. Read counts are normalized for differences in sequencing depth and normalized for gene size. How is how the RPKM is calculated.
- count up the total reads in a sample and divide that number by 1,000,000 - this is our 'per million' scaling factor
- divide the read counts by the 'per million' scaling factor, which normalizes for sequencing depth, giving you reads per million (RPM)
- divide the RPM values by the length of the gene, in kilobases, which gives you RPKM
In this case, RPKM = {raw_counts / [sum(raw_counts) / 1000000]} / median_length_nomalized
In the project, only RPKM data are retained.
After labeling RNA sequencing data with vital_status in clincal data, the shape of processed Clinical data is (779, 18) and the shape of processed RNASeq data is (878, 20533+1), where 1 is the column of labels (0 or 1). Note here that the number of sample in RNASeq data is a little larger than that in Clinical data. This is because for some sample in RNASeq data, two pieces of samples belong to one patient. For instance, TCGA-A7-A13G is the code of the patient, but TCGA-A7-A13G-01A and TCGA-A7-A13G-11A co-exists. The underlying reason awaits to be explored.
Highlights in Olson2018
This paper, 'Data-driven advice for applying machine learning to bioinformatics problems', analyses 13 state-of-the-art commonly-used machine learning algorithms on a set of 165 classification problems. The 13 ML algorithms come from the sklearn implementations with individual hyperparameters, and the 165 classification problem datasets come from the Penn Machine Learning Benchmark (PMLB).
Authors concluded that both selecting the right ML algorithm and tuning its parameters are vitally importantly for most problems. Furthermore, they listed their top five recommended algorithms on 165 datasets. It is important to note that these algorithms and parameters will not work best on all supervised classification problems, and they should only be used as starting points. Here is the full list of top five recommendations:
- GradientBoostingClassifier (51/16)
- loss = 'deviance'
- learning_rate = 0.1
- n_estimators = 500
- max_depth = 3
- max_features = 'log2'
- RandomForestClassifier (19/165)
- n_estimators = 500
- max_features = 0.25
- criterion = 'entropy'
- SVC (16/165)
- C = 0.01
- gamma = 0.1
- kernel = 'poly'
- degree = 3
- coef0 = 10.0
- ExtraTreesClassifier (12/165)
- n_estimators = 1000
- max_features = 'log2'
- criterion = 'entropy'
- LogisticRegression
- C = 1.5
- penalty = 'l1'
- fit_intercept = True
Traditionally, survival analysis was developed to measure lifespans of individuals. The analysis can be further applied to not just traditional births and deaths, but any duration. Medical professionals might be interested in the time between childbirths, where a birth in this case is the event of having a child, and a death is becoming pregnant again (obviously, we are loose with our definitions of birth and death).
At the time you want to make inferences about durations, it is possible that not all the death events have occured yet. The individuals in a population who have not been subject to the death event are labeled as right-censored, i.e., we can not view the rest of their life history due to some external circumstances. All the information we have on these individuals are their current lifetime durations, which is naturally less than their actual lifetimes. There is also left-censorship, where an individual's birth event is not seen.
A common mistake data analysts make a choosing to ignore the right-censored individuals. Survival analysis was originally developed to solve this type of problem, to deal with estimation when our data is right-censored. Even in the case where all events have been observed, i.e. no censorship, survival analysis is still a very useful tool to understand durations.
classification accuracies
RNASeq | Random Forest (RFs) | Gradient Tree Boosting (GBDT) | XGBoost |
---|---|---|---|
raw count | |||
RPKM |
comparison of their performance using survival analysis
p-values of every gene signature, using RNA raw count as features and tested on KM Estimate p-values of every gene signature, using RNA RPKM as features and tested on KM Estimate
Statistics of p-values in above figures
the number of p-values < 0.01 | Random Forests | Gradient Tree Boosting | XGBoost |
---|---|---|---|
raw counts | 6 | 8 | 7 |
RPKM | 2 | 12 | 9 |