From 9ac2552f99e6be851a5448957c4b8a8e42aa6ba5 Mon Sep 17 00:00:00 2001 From: ttimbers Date: Sat, 18 Jan 2020 23:28:48 -0800 Subject: [PATCH] clarifying some language in the proposal --- README.Rmd | 6 +++--- README.md | 33 +++++++++++++++++---------------- src/breast_cancer_eda.Rmd | 3 ++- src/breast_cancer_eda.md | 31 +++++++++++++++++++++---------- 4 files changed, 43 insertions(+), 30 deletions(-) diff --git a/README.Rmd b/README.Rmd index db80bbe..5fa6bf0 100644 --- a/README.Rmd +++ b/README.Rmd @@ -13,13 +13,13 @@ Demo of a data analysis project for DSCI 522 (Data Science workflows); a course For this project we are trying to answer the question: given tumour image measurements is a newly discovered tumour benign or malignant? Answering this question is important because traditional, non-data-driven methods for tumour diagnosis are quite subjective and can depend on the diagnosing physicians skill as well as experience [@Streetetal]. Furthermore, benign tumours are not normally dangerous; the cells stay in the same place and the tumour stops growing before it gets very large. By contrast, in malignant tumours, the cells invade the surrounding tissue and spread into nearby organs where they can cause serious damage. Thus, it is important to quickly and accurately diagnose the tumour type to guide patient treatment. -The data set used in this project is of digitized breast cancer image features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison [@Streetetal]. It was sourced from the UCI Machine Learning Repository [@Dua2019] and can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), specifically [this file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). Each row in the data set represents an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians. +The data set used in this project is of digitized breast cancer image features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison [@Streetetal]. It was sourced from the UCI Machine Learning Repository [@Dua2019] and can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), specifically [this file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). Each row in the data set represents summary statistics from measurements of an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians. To answer the predictive question posed above, we plan to build a predictive classification model. Before building our model we will partition the data into a training and test set (split 75%:25%) and perform exploratory data analysis to assess whether there is a strong class imbalance problem that we might need to address, as well as explore whether there are any predictors whose distribution looks very similar between the two classes, and thus we might omit from our analyis. The class counts will be presented as a table and used to inform whether we think there is a class imbalance problem. The predictor distributions across classes will be plotted as facetted (by predictor) ridge plots where the densities are coloured by class. -Given that all measurements are continous in nature, and the outcome we are trying to predict is one of two classes, one suitable and simple we plan to first explore is using a k-nearest neighbours classification algorithm. With this algorithm, we will have to choose $K$ the number of nearest neighbours to use for prediction. We will choose $K$ via cross-validation using ~ 30 folds as this Wisconsin Breast Cancer data set is not very large and has only 569 observations. We will use overall accuracy to choose $K$. A line plot of overall accuracy versus $K$ will be included as part of the final report for this project. +Given that all measurements are continous in nature, and the outcome we are trying to predict is one of two classes, one suitable and simple approach that we plan to first explore is using a k-nearest neighbours classification algorithm. With this algorithm, we will have to choose K, the number of nearest neighbours to use for prediction. We will choose K via cross-validation using ~ 30 folds because this Wisconsin Breast Cancer data set is not very large, having only 569 observations. We will use overall accuracy to choose K. A line plot of overall accuracy versus $K$ will be included as part of the final report for this project. -After settling on our final model, we will re-fit the model on the entire training data set, and then evaulate it's performance on the test data set. At this point we will look at overall accuracy as well as misclassification errors (from the confusion matrix) to assess prediction performance. These values will be reported as a table in the final report. +After selecting our final model, we will re-fit the model on the entire training data set, and then evaulate it's performance on the test data set. At this point we will look at overall accuracy as well as misclassification errors (from the confusion matrix) to assess prediction performance. These values will be reported as a table in the final report. Thus far we have performed some exploratory data analysis, and the report for that can be found [here](src/breast_cancer_eda.md). diff --git a/README.md b/README.md index 73236a1..6a7049e 100644 --- a/README.md +++ b/README.md @@ -31,10 +31,11 @@ Repository (Dua and Graff 2017) and can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+\(Diagnostic\)), specifically [this file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). -Each row in the data set represents an image of a tumour sample, -including the diagnosis (benign or malignant) and several other -measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis -for each image was conducted by physicians. +Each row in the data set represents summary statistics from measurements +of an image of a tumour sample, including the diagnosis (benign or +malignant) and several other measurements (e.g., nucleus texture, +perimeter, area, etc.). Diagnosis for each image was conducted by +physicians. To answer the predictive question posed above, we plan to build a predictive classification model. Before building our model we will @@ -49,18 +50,18 @@ predictor distributions across classes will be plotted as facetted (by predictor) ridge plots where the densities are coloured by class. Given that all measurements are continous in nature, and the outcome we -are trying to predict is one of two classes, one suitable and simple we -plan to first explore is using a k-nearest neighbours classification -algorithm. With this algorithm, we will have to choose \(K\) the number -of nearest neighbours to use for prediction. We will choose \(K\) via -cross-validation using ~ 30 folds as this Wisconsin Breast Cancer data -set is not very large and has only 569 observations. We will use overall -accuracy to choose \(K\). A line plot of overall accuracy versus \(K\) -will be included as part of the final report for this project. - -After settling on our final model, we will re-fit the model on the -entire training data set, and then evaulate it’s performance on the test -data set. At this point we will look at overall accuracy as well as +are trying to predict is one of two classes, one suitable and simple +approach that we plan to first explore is using a k-nearest neighbours +classification algorithm. With this algorithm, we will have to choose K, +the number of nearest neighbours to use for prediction. We will choose K +via cross-validation using ~ 30 folds because this Wisconsin Breast +Cancer data set is not very large, having only 569 observations. We will +use overall accuracy to choose K. A line plot of overall accuracy versus +\(K\) will be included as part of the final report for this project. + +After selecting our final model, we will re-fit the model on the entire +training data set, and then evaulate it’s performance on the test data +set. At this point we will look at overall accuracy as well as misclassification errors (from the confusion matrix) to assess prediction performance. These values will be reported as a table in the final report. diff --git a/src/breast_cancer_eda.Rmd b/src/breast_cancer_eda.Rmd index 08ad782..a68547a 100644 --- a/src/breast_cancer_eda.Rmd +++ b/src/breast_cancer_eda.Rmd @@ -55,7 +55,8 @@ colnames(bc_data) <- c("id", n_nas <- nrow(bc_data) - (drop_na(bc_data) %>% tally()) ``` -The data set used in this project is of digitized breast cancer image features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison. It was sourced from the UCI Machine Learning Repository [@Dua2019] and can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), specifically [this file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). Each row in the data set represents an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians. There are `r nrow(bc_data)` observations in the data set, and `r ncol(bc_data) - 1` features. There are `r n_nas` observations with missing values in the data set. Below we show the number of each observations for each of the classes in the data set. + +The data set used in this project is of digitized breast cancer image features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison [@Streetetal]. It was sourced from the UCI Machine Learning Repository [@Dua2019] and can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), specifically [this file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). Each row in the data set represents summary statistics from measurements of an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians. There are `r nrow(bc_data)` observations in the data set, and `r ncol(bc_data) - 1` features. There are `r n_nas` observations with missing values in the data set. Below we show the number of each observations for each of the classes in the data set. ```{r class counts} kable(summarise(bc_data, diff --git a/src/breast_cancer_eda.md b/src/breast_cancer_eda.md index f0c0768..9573954 100644 --- a/src/breast_cancer_eda.md +++ b/src/breast_cancer_eda.md @@ -5,19 +5,20 @@ Exploratory data analysis of the Wisconsin Breast Cancer data set The data set used in this project is of digitized breast cancer image features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. -Mangasarian at the University of Wisconsin, Madison. It was sourced from -the UCI Machine Learning Repository (Dua and Graff 2017) and can be -found +Mangasarian at the University of Wisconsin, Madison (Street, Wolberg, +and Mangasarian 1993). It was sourced from the UCI Machine Learning +Repository (Dua and Graff 2017) and can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+\(Diagnostic\)), specifically [this file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). -Each row in the data set represents an image of a tumour sample, -including the diagnosis (benign or malignant) and several other -measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis -for each image was conducted by physicians. There are 569 observations -in the data set, and 31 features. There are 0 observations with missing -values in the data set. Below we show the number of each observations -for each of the classes in the data set. +Each row in the data set represents summary statistics from measurements +of an image of a tumour sample, including the diagnosis (benign or +malignant) and several other measurements (e.g., nucleus texture, +perimeter, area, etc.). Diagnosis for each image was conducted by +physicians. There are 569 observations in the data set, and 31 features. +There are 0 observations with missing values in the data set. Below we +show the number of each observations for each of the classes in the data +set. | Benign cases | Malignant cases | | -----------: | --------------: | @@ -80,4 +81,14 @@ Sciences. . +
+ +Street, W. Nick, W. H. Wolberg, and O. L. Mangasarian. 1993. “Nuclear +feature extraction for breast tumor diagnosis.” In *Biomedical Image +Processing and Biomedical Visualization*, edited by Raj S. Acharya and +Dmitry B. Goldgof, 1905:861–70. International Society for Optics; +Photonics; SPIE. . + +
+