- Python with ML Libraries installed
- VS Code Environmet
Source - Kaggle
- ID number
- Diagnosis (M = malignant, B = benign)
- (3 – 32) Ten real-valued features are computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)
-
The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
-
All feature values are recoded with four significant digits.
-
Headers of the dataset:
- Exploratory data analysis was performed using Pandas. Column with missing values was dropped.
- Categorical variable is converted to numerical values.
-
Dataset is split into training and test set. 75% of the data was used for training while remaining 25% was used for test.
-
Logistic Regression package is imported from Scikit-Learn and applied to get prediction on the presence of cancer.
-
The predicted values is plotted as a heatmap of the Confusion Matrix using Seaborn Library to determine the number of Type I and Type II errors.