layout | title | description |
---|---|---|
default |
Customer Segmentation with KMeans Clustering |
A short description of your project |
Jupyter notebook available here
- Overview
- Dataset Description
- Objective
- Theoretical Background
- Methodology
- Key Results
- Customer Segmentation Strategies
- Conclusion
- References
This project utilizes K-Means clustering, an unsupervised machine learning algorithm, to segment customers based on their purchasing behavior. By understanding customer profiles, businesses can tailor their strategies to boost customer satisfaction, loyalty, and revenue.
The dataset is available here. This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.
Feature | Description |
---|---|
InvoiceNo |
Unique 6-digit identifier for each transaction. Cancellation codes start with 'C'. |
StockCode |
Unique code assigned to each product. |
Description |
Name of the product. |
Quantity |
Number of units purchased in a transaction. |
InvoiceDate |
Date and time of the transaction. |
UnitPrice |
Price per unit in GBP (£). |
CustomerID |
Unique identifier for each customer. |
Country |
Country where the customer resides. |
The primary goals of this project are:
- To identify meaningful customer groups based on purchasing behavior.
- To understand the characteristics of each customer segment.
- To provide actionable insights for improving marketing and customer engagement strategies.
K-Means is chosen for this project because:
- Efficiency: K-Means works well with large datasets and provides quick clustering results.
- Scalability: It can handle a variety of data sizes and complexities.
- Interpretability: The results are easy to visualize and interpret, especially for customer segmentation tasks.
- Versatility: It works well with numerical data, which is predominant in this dataset.
- Initialization:
- Choose the number of clusters (‘k’).
- Randomly initialize cluster centroids.
- Assignment Step:
- Assign each data point to the nearest centroid using the Euclidean distance.
- Update Step:
- Recalculate the centroids by taking the mean of all points assigned to a cluster.
- Convergence:
- Repeat steps 2 and 3 until the centroids no longer change significantly or a predefined number of iterations is reached.
K-Means minimizes the Within-Cluster Sum of Squares (WCSS):
{% raw %}
Where:
- k: Number of clusters
-
$C_{i}$ : Cluster ( i ) -
$\mu_{i}$ : Centroid of cluster ( i ) - x: Data point
The Silhouette Score evaluates the quality of clustering by comparing intra-cluster and inter-cluster distances. It ranges from (-1) to (1):
Where:
- ( a(i) ): Average distance between ( i ) and other points in the same cluster.
- ( b(i) ): Average distance between ( i ) and points in the nearest cluster.
EDA involved:
- Checking for null values and invalid data (e.g., negative prices).
- Analyzing the distribution of numerical features like
Quantity
,UnitPrice
, andCustomerID
. - Identifying patterns in
InvoiceDate
andCountry
to uncover insights into purchasing trends.
- A significant number of transactions had missing
CustomerID
values. - Negative or zero values in
Quantity
andUnitPrice
indicated potential errors or special promotions.
- Removing Invalid Entries:
- Dropped rows with missing
CustomerID
. - Excluded transactions with negative or zero
Quantity
andUnitPrice
.
- Dropped rows with missing
- Processing Categorical Variables:
- Cleaned
StockCode
by removing non-product entries like "ADJUST" and "TEST".
- Cleaned
- Created New Features:
sales_line_total = Quantity × UnitPrice
: Total revenue per transaction.
- Aggregated Data:
- Grouped by
CustomerID
to calculate:- Monetary Value: Total spending.
- Frequency: Number of unique transactions.
- Recency: Days since the last purchase.
- Grouped by
-
Used StandardScaler to normalize features (
Monetary Value
,Frequency
,Recency
) with Z-score scaling:$$ Z = \frac{X - \mu}{\sigma} $$
Where :
-
$Z$ : The Z-score, representing the normalized value of the feature. It indicates how many standard deviations a particular value$X$ is from the mean$\mu$ . -
$X$ : The original value of the feature being normalized. -
$\mu$ : The mean (average) value of the feature, calculated as: $$ \mu = \frac{\sum_{i=1}^{n} X_i}{n} $$ where:-
$n$ : Number of data points. -
$X_i$ : Individual feature values.
-
-
$\sigma$ : The standard deviation of the feature, measuring the spread or variability of the data around the mean, calculated as: $$ \sigma = \sqrt{\frac{\sum_{i=1}^{n} (X_i - \mu)^2}{n}} $$
-
Optimal K Selection:
- Used the Elbow Method and Silhouette Scores to determine
$k$ = 4.
- Used the Elbow Method and Silhouette Scores to determine
-
K-Means Execution:
- Clustered scaled data into four segments.
Cluster | Characteristics | Description |
---|---|---|
0 |
High-value, frequent buyers | Regular buyers with high spending. |
1 |
Infrequent, low-value customers | Customers who purchase sporadically. |
2 |
New or low-engagement customers | Customers with low spending but recent activity. |
3 |
Loyal, high-frequency, high-value buyers | The most valuable customers in terms of revenue and engagement. |
- Rationale: High-value customers who make frequent purchases.
- Actions:
- Introduce loyalty programs.
- Provide personalized recommendations.
- Rationale: Customers with sporadic and low spending patterns.
- Actions:
- Send targeted email campaigns.
- Offer exclusive discounts to incentivize purchases.
- Rationale: Recent customers with potential for growth.
- Actions:
- Offer onboarding incentives.
- Provide exceptional customer service.
- Rationale: Top-performing customers in terms of revenue and frequency.
- Actions:
- Develop a premium rewards program.
- Provide exclusive early access to products or events.
K-Means clustering successfully segmented customers into actionable groups. These insights can drive personalized marketing and improve overall business strategy.
- Deploy the clustering model in a production environment.
- Integrate segmentation results into CRM systems.
- Experiment with alternative clustering algorithms (e.g., Hierarchical Clustering).