Skip to content

ml-clustering-proj/wholesale-clustering

Repository files navigation

wholesale-clustering


Purpose

  • Clasifying sales data with given 8 features and giving appropriate sales idea with the result.
  • Use unsupervised learning – Hierarchical Clustering, DBSCAN, K-Means Clustering.

File Directory

wholesale-clustering
├── data
│   └── Wholesale customers data.csv               # 연간 고객 판매량 데이터 csv
├── wholesale-kmeans.ipynb                         # 반소희, 이서린 : K-Means Clustering
├── wholesale_hierarchical.ipynb                   # 장아연 :  Hierarchical Clustering
└── wholesale_DBScan.ipynb                         # 이희원, 최지민 : DBSCAN

Instructions

  1. Open colab by clicking links below
  2. Set Hardware accelerator to GPU (Runtime - Change runtime type - Hardware accelerator : GPU)
  3. Run all codes (shortcut: cmd/ctrl + F9)

Methods

  1. load data

  2. check missing value

    • As there was no missing value, skipped this step.
  3. preprosessing

    • numeric feature
      • normalization
        • For numeric features, there are some well known data preprosessing, standardlization and normalization.
        • Our numeric features need to be normalized so we used minmax scaler for min-max normalization.
      • outliers
        • After normalization, we check byplot of each features and we can find out there are some outliers-based on IQR.
        • So we replaced outliers with Q3 value(75% of max value) of each feature.
    • categorical feature
      • Use one-hot encoding to train machine learning model.
  4. Find Optimal K

    • Elbow Method
    • Silhoutte Coefficient
      • To find our optimal K, we used 2 methods mentioned above.
      • For k in range 2 to 15, we drawed graph to check it visually.
      • Through these two method, we could get Optimal K = 6 in common.
  5. Clustering with 6 clusters

    • Use KMeans, DBSCAN, Hierarchical clustering to clustering data.
    • Preceed clustering with k = 6.
    • There are 2 nominal features, Region and Channel, and we can get 6 combinations from those features.
    • And we now can see that those 6 combinations corrrspond to 6 labels one-on-one.

Analysis Clusters

  • 6 clusters are created.

    KMeans Hierarchical
    image image image
    Clusters
    image
  • Result of each cluster.

    Results of wholesale clustering
    image
    image

Reasons for DBSCAN failure

  • DBSCAN is sensitive to density of data.
  • Data in each column are gathered in short range of sales amount (7000-12000).
    Density plot
    image
  • Data are gathered in high and even density.
  • It is hard to find appropriate epsilon and MinPts.
    • Because data are gathered in high density, slight change in hyperparameter occurs huge change in clustering result.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •