Skip to content

Commit

Permalink
clarifying some language in the proposal
Browse files Browse the repository at this point in the history
  • Loading branch information
ttimbers committed Jan 19, 2020
1 parent 0fc4f3a commit 9ac2552
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 30 deletions.
6 changes: 3 additions & 3 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@ Demo of a data analysis project for DSCI 522 (Data Science workflows); a course

For this project we are trying to answer the question: given tumour image measurements is a newly discovered tumour benign or malignant? Answering this question is important because traditional, non-data-driven methods for tumour diagnosis are quite subjective and can depend on the diagnosing physicians skill as well as experience [@Streetetal]. Furthermore, benign tumours are not normally dangerous; the cells stay in the same place and the tumour stops growing before it gets very large. By contrast, in malignant tumours, the cells invade the surrounding tissue and spread into nearby organs where they can cause serious damage. Thus, it is important to quickly and accurately diagnose the tumour type to guide patient treatment.

The data set used in this project is of digitized breast cancer image features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison [@Streetetal]. It was sourced from the UCI Machine Learning Repository [@Dua2019] and can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), specifically [this file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). Each row in the data set represents an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians.
The data set used in this project is of digitized breast cancer image features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison [@Streetetal]. It was sourced from the UCI Machine Learning Repository [@Dua2019] and can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), specifically [this file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). Each row in the data set represents summary statistics from measurements of an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians.

To answer the predictive question posed above, we plan to build a predictive classification model. Before building our model we will partition the data into a training and test set (split 75%:25%) and perform exploratory data analysis to assess whether there is a strong class imbalance problem that we might need to address, as well as explore whether there are any predictors whose distribution looks very similar between the two classes, and thus we might omit from our analyis. The class counts will be presented as a table and used to inform whether we think there is a class imbalance problem. The predictor distributions across classes will be plotted as facetted (by predictor) ridge plots where the densities are coloured by class.

Given that all measurements are continous in nature, and the outcome we are trying to predict is one of two classes, one suitable and simple we plan to first explore is using a k-nearest neighbours classification algorithm. With this algorithm, we will have to choose $K$ the number of nearest neighbours to use for prediction. We will choose $K$ via cross-validation using ~ 30 folds as this Wisconsin Breast Cancer data set is not very large and has only 569 observations. We will use overall accuracy to choose $K$. A line plot of overall accuracy versus $K$ will be included as part of the final report for this project.
Given that all measurements are continous in nature, and the outcome we are trying to predict is one of two classes, one suitable and simple approach that we plan to first explore is using a k-nearest neighbours classification algorithm. With this algorithm, we will have to choose K, the number of nearest neighbours to use for prediction. We will choose K via cross-validation using ~ 30 folds because this Wisconsin Breast Cancer data set is not very large, having only 569 observations. We will use overall accuracy to choose K. A line plot of overall accuracy versus $K$ will be included as part of the final report for this project.

After settling on our final model, we will re-fit the model on the entire training data set, and then evaulate it's performance on the test data set. At this point we will look at overall accuracy as well as misclassification errors (from the confusion matrix) to assess prediction performance. These values will be reported as a table in the final report.
After selecting our final model, we will re-fit the model on the entire training data set, and then evaulate it's performance on the test data set. At this point we will look at overall accuracy as well as misclassification errors (from the confusion matrix) to assess prediction performance. These values will be reported as a table in the final report.

Thus far we have performed some exploratory data analysis, and the report for that can be found [here](src/breast_cancer_eda.md).

Expand Down
33 changes: 17 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,11 @@ Repository (Dua and Graff 2017) and can be found
[here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+\(Diagnostic\)),
specifically [this
file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data).
Each row in the data set represents an image of a tumour sample,
including the diagnosis (benign or malignant) and several other
measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis
for each image was conducted by physicians.
Each row in the data set represents summary statistics from measurements
of an image of a tumour sample, including the diagnosis (benign or
malignant) and several other measurements (e.g., nucleus texture,
perimeter, area, etc.). Diagnosis for each image was conducted by
physicians.

To answer the predictive question posed above, we plan to build a
predictive classification model. Before building our model we will
Expand All @@ -49,18 +50,18 @@ predictor distributions across classes will be plotted as facetted (by
predictor) ridge plots where the densities are coloured by class.

Given that all measurements are continous in nature, and the outcome we
are trying to predict is one of two classes, one suitable and simple we
plan to first explore is using a k-nearest neighbours classification
algorithm. With this algorithm, we will have to choose \(K\) the number
of nearest neighbours to use for prediction. We will choose \(K\) via
cross-validation using ~ 30 folds as this Wisconsin Breast Cancer data
set is not very large and has only 569 observations. We will use overall
accuracy to choose \(K\). A line plot of overall accuracy versus \(K\)
will be included as part of the final report for this project.

After settling on our final model, we will re-fit the model on the
entire training data set, and then evaulate it’s performance on the test
data set. At this point we will look at overall accuracy as well as
are trying to predict is one of two classes, one suitable and simple
approach that we plan to first explore is using a k-nearest neighbours
classification algorithm. With this algorithm, we will have to choose K,
the number of nearest neighbours to use for prediction. We will choose K
via cross-validation using ~ 30 folds because this Wisconsin Breast
Cancer data set is not very large, having only 569 observations. We will
use overall accuracy to choose K. A line plot of overall accuracy versus
\(K\) will be included as part of the final report for this project.

After selecting our final model, we will re-fit the model on the entire
training data set, and then evaulate it’s performance on the test data
set. At this point we will look at overall accuracy as well as
misclassification errors (from the confusion matrix) to assess
prediction performance. These values will be reported as a table in the
final report.
Expand Down
3 changes: 2 additions & 1 deletion src/breast_cancer_eda.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,8 @@ colnames(bc_data) <- c("id",
n_nas <- nrow(bc_data) - (drop_na(bc_data) %>% tally())
```

The data set used in this project is of digitized breast cancer image features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison. It was sourced from the UCI Machine Learning Repository [@Dua2019] and can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), specifically [this file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). Each row in the data set represents an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians. There are `r nrow(bc_data)` observations in the data set, and `r ncol(bc_data) - 1` features. There are `r n_nas` observations with missing values in the data set. Below we show the number of each observations for each of the classes in the data set.

The data set used in this project is of digitized breast cancer image features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison [@Streetetal]. It was sourced from the UCI Machine Learning Repository [@Dua2019] and can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), specifically [this file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). Each row in the data set represents summary statistics from measurements of an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians. There are `r nrow(bc_data)` observations in the data set, and `r ncol(bc_data) - 1` features. There are `r n_nas` observations with missing values in the data set. Below we show the number of each observations for each of the classes in the data set.

```{r class counts}
kable(summarise(bc_data,
Expand Down
31 changes: 21 additions & 10 deletions src/breast_cancer_eda.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,20 @@ Exploratory data analysis of the Wisconsin Breast Cancer data set

The data set used in this project is of digitized breast cancer image
features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L.
Mangasarian at the University of Wisconsin, Madison. It was sourced from
the UCI Machine Learning Repository (Dua and Graff 2017) and can be
found
Mangasarian at the University of Wisconsin, Madison (Street, Wolberg,
and Mangasarian 1993). It was sourced from the UCI Machine Learning
Repository (Dua and Graff 2017) and can be found
[here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+\(Diagnostic\)),
specifically [this
file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data).
Each row in the data set represents an image of a tumour sample,
including the diagnosis (benign or malignant) and several other
measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis
for each image was conducted by physicians. There are 569 observations
in the data set, and 31 features. There are 0 observations with missing
values in the data set. Below we show the number of each observations
for each of the classes in the data set.
Each row in the data set represents summary statistics from measurements
of an image of a tumour sample, including the diagnosis (benign or
malignant) and several other measurements (e.g., nucleus texture,
perimeter, area, etc.). Diagnosis for each image was conducted by
physicians. There are 569 observations in the data set, and 31 features.
There are 0 observations with missing values in the data set. Below we
show the number of each observations for each of the classes in the data
set.

| Benign cases | Malignant cases |
| -----------: | --------------: |
Expand Down Expand Up @@ -80,4 +81,14 @@ Sciences. <http://archive.ics.uci.edu/ml>.

</div>

<div id="ref-Streetetal">

Street, W. Nick, W. H. Wolberg, and O. L. Mangasarian. 1993. “Nuclear
feature extraction for breast tumor diagnosis.” In *Biomedical Image
Processing and Biomedical Visualization*, edited by Raj S. Acharya and
Dmitry B. Goldgof, 1905:861–70. International Society for Optics;
Photonics; SPIE. <https://doi.org/10.1117/12.148698>.

</div>

</div>

0 comments on commit 9ac2552

Please sign in to comment.