Skip to content

An analysis that predicts individual health insurance costs charged by health insurance companies based on age, sex, BMI, children, smoking, and region using predictive modeling and machine learning.

Notifications You must be signed in to change notification settings

jasonzelaya/Insurance-Forecast

Repository files navigation

Health Insurance Cost Forecast

-- Status: Completed

Purpose

The purpose of this analysis is to predict individual health insurance costs charged by health insurance companies based on age, sex, BMI, children, smoking, and region.

Methods Used

  • Supervised Machine Learning
  • Inferential Statistics
  • Descriptive Statistics
  • Machine Learning
  • Data Visualization
  • Predictive Modeling
  • Regression Analysis
  • Factor Analysis
  • Random Forest

Technologies

  • Python
  • R
  • Jupyter Notebook
  • Pandas
  • NumPy
  • Matplotlib
  • Scikit-learn
  • Graphviz
  • Seaborn
  • Yellowbrick
  • Pydot

Needs of this project

  • Data exploration/descriptive statistics
  • Data processing/cleaning
  • Statistical modeling
  • Writeup/reporting

Data Source

Kaggle: https://www.kaggle.com/mirichoi0218/insurance

Data Content

  • Age: Age of the beneficiary in years.
  • Sex: Whether the beneficiary is male or female.
  • BMI: Body mass index derived from the weight and height of an individual. A healthy BMI is generally known to be from 18.5 to 24.9.
  • Children: Number of dependents covered by health insurance.
  • Smoker: Whether or not the beneficiary smokes.
  • Region: The beneficiary's residential area in the US. The categories are northeast, southeast, southwest, northwest.
  • Charges: The price the beneficiary pays the health insurance companies in USD.

**Note: The individual paying for the health insurance is referred to as the "beneficiary" in the definitions.

Underlying Assumptions

The model should conform to the assumptions of linear regression to be usable in practice. To confirm this we examined the data set to check:

  • The regression model is linear in parameters
  • The mean of residuals is zero
  • Homoscedasticity of residuals or equal variance
  • Normality of residuals

ML Algorithm

  • Multi-linear regression (supervised learning)
  • Pandas.crosstab categorical variable sex smoker region to confirm values
  • Check for typos
  • Dollars, round decimals
  • Range of age
  • Incorrect entries
  • Data validation = exploratory data analysis
  • Data validation = cleaning the data

Other Contributing Members

Contact

Jason.Zelaya474@gmail.com

About

An analysis that predicts individual health insurance costs charged by health insurance companies based on age, sex, BMI, children, smoking, and region using predictive modeling and machine learning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •