This project will investigate the Wisconsin Breast Cancer dataset. With a focus on the following:
Undertake an analysis/review of the dataset and present an overview and background. Provide a literature review on classifiers which have been applied to the dataset and compare their performance. Present a statistical analysis of the dataset. Using a range of machine learning algorithms, train a set of classifiers on the dataset (using SKLearn etc.) and present classification performance results. Detail your rationale for the parameter selections you made while training the classifiers. Compare, contrast and critique your results with reference to the literature. Discuss and investigate how the dataset could be extended – using data synthesis of new tumour datapoints.
Title: Wisconsin Breast Cancer Database (January 8, 1991)
Source: Dr. WIlliam H. Wolberg (physician), University of Wisconsin Hospitals, Madison, Wisconsin, USA
Additional Sources:
- O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
- William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196.
- O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition via linear programming: Theory and application to medical diagnosis", in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.
- K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).
Data Set Characteristics: Multivariate Number of Instances: 699 Area: Life Attribute Characteristics: Integer Number of Attributes: 10
In order to run this on your PC, you require the following:
Install Anaconda https://www.anaconda.com/products/individual this ditribution includes Python and serveral packages used in this Assignment including the numpy package.
Install Jupyter: https://jupyter.org/ to run numpy-random.ipynb
Github: https://github.com/AndrewShanahan/PfDA_2
[01] https://stackoverflow.com/questions/17096311/why-do-i-need-to-explicitly-push-a-new-branch/17096880#17096880
[02] https://www.educative.io/answers/the-fatal-refusing-to-merge-unrelated-histories-git-error
[03] https://www.freecodecamp.org/news/how-to-write-a-good-readme-file/
[04] Dataset info/descriptoin https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%290
[05] https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29
[06] https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)
[07] https://data.world/health/breast-cancer-wisconsin
[08] Importing dataset and troubleshooting: https://stackoverflow.com/questions/31797013/how-to-open-a-data-file-extension
[09] Attribute Information: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29
[10] https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
[11] https://www2.hse.ie/conditions/breast-cancer-women/?gclid=CjwKCAiAnZCdBhBmEiwA8nDQxV9FZLIuR4GMAzCaJFwNTvHQGzP8oK-LCGZ-jOYXBTyNlzBNjKMK6RoCzLkQAvD_BwE&gclsrc=aw.ds
[12] https://www.cancer.ie/cancer-information-and-support/cancer-types/breast-cancer
[13] https://www.who.int/news-room/fact-sheets/detail/breast-cancer
[14] Lavanya, D. (05/11/2011) ‘Analysis of feature selection with classification: Breast Cancer Datasets’, Indian Journal of Computer Science and Engineering, ISSN : 0976-5166, P. xxx. Available at: http://ijcse.com/docs/INDJCSE11-02-05-167.pdf (Accessed: 29/12/2022).
[15] Siham, M et al. (11/07/2020) 'Analysis of Breast Cancer Detection Using Different Machine Learning Techniques', Communications in Computer and Information Science, ISSN 1234, P. xxx. Available at: https://link.springer.com/chapter/10.1007/978-981-15-7205-0_10#Sec2 (Accessed: 01/01/2022).
[16] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
[17] https://scikit-learn.org/stable/modules/neighbors.html
[18] https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
[19] Datacamp - numerous courses/tracks completed over last number of months have supported this exercise: https://www.datacamp.com/
[20] Udemy course: https://www.udemy.com/course/the-modern-python3-bootcamp/learn/lecture/8680110?start=94#overview
[21] W3schools - Resource used on regular basis: https://www.w3schools.com/
[22] Stackoverflow - Resource used to help troubleshoot problems and help with coding: https://stackoverflow.com/
[23] matplotlyb.plyplot:https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html
[24] pandas: https://pandas.pydata.org/
[25] numpy: https://numpy.org/doc/stable/index.html