Using Machine Learning algorithms to optimize an insurance company's marketing campaigns.
An insurance company wants to offer new auto insurance to its customers. However, its marketing budget is limited. The company asked its Data Science team to find out which customers should be given priority to receive new insurance offers.
After understanding the problem (optimizing the use of the marketing budget), the company's data team adopted a strategy of classification: customers will be classified according to the probability of purchasing the new insurance. This probability will be provided by a Machine Learning algorithm.
With the base ordered by the purchase probability, the marketing team will be able to focus its efforts only on the customers that are at the top of the sorted dataset ( = greater purchase probability).
Thus, we can say that the expected result of Machine Learning models is to group the customers most likely to purchase at the top of the dataset, so that a small % of the dataset, concentrated at the top, contains a large % of the customers most likely to purchase.
That is, in data scientists lingo, a “Learn to Rank” solution.
The original dataset is at Kaggle, but for educational purposes it was stored in a SQL database on AWS.
2.1. Data Overview: head, dimensions, data types.
2.2. Null Value Check
2.3. Checking Attribute Types
3.1. Descriptive statistics
3.2. Separation of Attribute Classes: numeric and categorical.
3.3. Univariate Analysis
3.4. Bivariate Analysis: relationship between the response variable and several attributes, most of them categorical.
3.5. Multivariate analysis:
3.5.1. Correlation between all numeric attributes: Pearson Correlation.
3.5.2. Verification of the degree of relationship between categorical attributes: Cramér's V.
4.1. Split dataset: into training set and test set.
4.2. Standardization
4.3. Rescaling
4.4. Encoding
The ExtraTrees classifier algorithm was used to select attributes.
Four models were chosen for evaluation: K-Nearest Neighbors, Linear Regression, ExtraTrees and Random Forest.
The trained models were evaluated through Holdout Validation and Cross-validation.
The evaluation metrics are: Precision@K and Recall@K, the latter being the most important for the problem at hand.
The model was deployed as a containerized AWS Lambda Function via Docker.
The “real” data (the test dataset) was hosted on AWS S3.
Access to the model is through a Streamlit application hosted in a virtual machine on AWS EC2.
In this application, a web page asks the user to inform the percentage of customers most likely to buy that he wants to know (top %). The application then returns a table containing the data of these customers with the option to download this table as a .csv file
Only 12.26% of customers are interested in a new insurance offer. Considering the size of the base (300k+ customers), it is necessary to develop a solution to find these few interested customers at a low cost.
The average age of interested customers is approximately 45 years.
Customers with a previously damaged vehicle are 45x more likely to wish to receive a new insurance offer.
After cross-validation, the model chosen was Random Forest, as it presented the best average result in the Recall@K metric, the most relevant in this problem.
Holdout Validation:
Model | Precision | Recall | k% |
---|---|---|---|
random_forest | 0.2718 | 0.8921 | 40.0 |
extra_trees | 0.2696 | 0.8848 | 40.0 |
linear_regression | 0.2659 | 0.8726 | 40.0 |
knn | 0.2616 | 0.8587 | 40.0 |
Cross-validation:
Model | k_folds | precision_avg | precision_std | recall_avg | recall_std | k% |
---|---|---|---|---|---|---|
random_forest | 5 | 0.2734 | 0.0007 | 0.8924 | 0.0020 | 40.0 |
extra_trees | 5 | 0.2697 | 0.0009 | 0.8800 | 0.0019 | 40.0 |
linear_regression | 5 | 0.2672 | 0.0016 | 0.8720 | 0.0029 | 40.0 |
knn | 5 | 0.2631 | 0.0008 | 0.8587 | 0.0016 | 40.0 |
With the Machine Learning model adopted, the marketing team made call campaigns for ~90% of interested customers, reaching only 40% of the total customer base.
Assuming a cost per call of $10.00 and considering the total number of customers (381,109), the company achieved savings of $2,286,660 on phone calls costs (60% cost reduction).
- Implement access control to the application link.
- Allow user to upload customer data as a csv file.
- Display a dashboard on the application page with a brief exploratory analysis of the customer dataset used.
- Implement versioning of databases and ML models.
In this project, it was demonstrated how the data scientists team can optimize a company's marketing budget allocating it to the people most likely to buy.
The CRISP-DM framework, adapted for Data Science was used to organize the resolution of this problem. The main reason for adopting this framework is its ability to offer solutions and business insights quickly, on an ongoing basis, during a Data Science project.
Finally, in this project I had the opportunity to use solutions from the largest cloud provider in the world (AWS) to make the resolution of this problem as close as possible to a real situation.
- Kaggle Dataset: https://www.kaggle.com/datasets/anmolkumar/health-insurance-cross-sell-prediction
- CRISP-DM for Data Scientists: https://towardsdatascience.com/crisp-dm-ready-for-machine-learning-projects-2aad9172056a
- Cost per Lead in the Insurance Market: https://insuranceleadsguide.com/buying-insurance-leads/
- Cramér's V correlation matrix: https://www.kaggle.com/code/chrisbss1/cramer-s-v-correlation-matrix/notebook
- Hold-out vs. Cross-validation in Machine Learning: https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f
- Precision and Recall at K: https://medium.com/@m_n_malaeb/ recall-and-precision-at-k-for-recommender-systems-618483226c54
- Serverless Deployment of Machine Learning Models on AWS Lambda: https://towardsdatascience.com/serverless-deployment-of-machine-learning-models-on-aws-lambda-5bd1ca9b5c42
- Deploying a Streamlit app on AWS EC2: https://towardsdatascience.com/how-to-deploy-a-streamlit-app-using-an-amazon-free-ec2-instance-416a41f69dc3
- Fundamentals of MLOps — Part 2 | Data & Model Management with DVC: https://medium.com/analytics-vidhya/fundamentals-of-mlops-part-2-data-model-management-with-dvc-6be2ad284ec4