This project explores the relationship between various factors, such as sexual health, smoking, and medical history, to understand their impact on cervical cancer risk. By leveraging data analysis and statistical modeling, the goal is to identify significant predictors and develop a predictive model to assist in preventive healthcare efforts.
- Introduction
- Dataset
- Objectives
- Tools and Libraries
- Analysis Workflow
- Results and Insights
- How to Run the Project
Cervical cancer is a leading health concern globally, but it is also one of the most preventable cancers with early diagnosis and appropriate intervention. This project investigates the relationship between lifestyle, sexual health, and medical history factors to derive actionable insights and develop a predictive model to assist in preventive healthcare efforts.
- Source: Kaggle - Cervical Cancer Risk Classification Dataset
- Size: Approximately 858 rows and 36 columns
- Description: The dataset includes demographic, lifestyle, and medical history data, with a target variable (
Biopsy
) indicating whether cervical cancer was diagnosed.
- Perform data cleaning and preprocessing to handle missing or invalid values.
- Conduct exploratory data analysis (EDA) to identify patterns and trends.
- Evaluate correlations between variables and the target (
Biopsy
). - Develop a predictive model using logistic regression to assess cancer risk.
- Provide actionable insights for healthcare professionals.
- Python Libraries:
pandas
,numpy
: Data manipulation and preprocessingmatplotlib
,seaborn
: Data visualizationscikit-learn
: Statistical modeling and machine learning
- Power BI: For advanced data visualization
-
Data Cleaning
- Handle missing values and outliers.
- Create new features, such as
Smoking Severity
.
-
Exploratory Data Analysis
- Analyze distributions of key variables.
- Examine relationships between lifestyle factors, medical history, and cervical cancer.
-
Correlation Analysis
- Use correlation heatmaps to identify significant relationships between predictors and the target variable.
-
Predictive Modeling
- Train a logistic regression model to predict cervical cancer risk.
- Evaluate the model using metrics such as accuracy, precision, and recall.
-
Data Visualization in Power BI
- Create visualizations like scatter plots, bar charts, and heatmaps for insights.
-Demographics: Age and smoking habits are critical factors influencing cancer risk. -STD Impact: Certain STDs(Genital Herpes, HIV, Condylomatosis, Vulvo-Perineal Condylomatosis) particularly with early and recurrent diagnosis, significantly increase risk. -Contraceptive Use: Limited correlations observed between contraceptive use and cancer outcomes. -High-Risk Profiles: Smoking severity and multiple STD diagnoses highlight individuals at higher risk. -Correlations: Age, smoking, and sexual health variables are strongly linked to positive biopsy results.
- Key predictors of cervical cancer include smoking history, contraceptive use, and STDs.
- Strong correlations between lifestyle and medical history factors were identified.
- The logistic regression model demonstrated high predictive performance, highlighting its potential for healthcare applications.
- Clone this repository:
git clone https://github.com/yourusername/Analyzing-Factors-Affecting-Cervical-Cancer-Risk.git