Skip to content

Latest commit

 

History

History
90 lines (70 loc) · 4.25 KB

README.md

File metadata and controls

90 lines (70 loc) · 4.25 KB

wholesale-clustering


Purpose

  • Clasifying sales data with given 8 features and giving appropriate sales idea with the result.
  • Use unsupervised learning – Hierarchical Clustering, DBSCAN, K-Means Clustering.

File Directory

wholesale-clustering
├── data
│   └── Wholesale customers data.csv               # 연간 고객 판매량 데이터 csv
├── wholesale-kmeans.ipynb                         # 반소희, 이서린 : K-Means Clustering
├── wholesale_hierarchical.ipynb                   # 장아연 :  Hierarchical Clustering
└── wholesale_DBScan.ipynb                         # 이희원, 최지민 : DBSCAN

Instructions

  1. Open colab by clicking links below
  2. Set Hardware accelerator to GPU (Runtime - Change runtime type - Hardware accelerator : GPU)
  3. Run all codes (shortcut: cmd/ctrl + F9)

Methods

  1. load data

  2. check missing value

    • As there was no missing value, skipped this step.
  3. preprosessing

    • numeric feature
      • normalization
        • For numeric features, there are some well known data preprosessing, standardlization and normalization.
        • Our numeric features need to be normalized so we used minmax scaler for min-max normalization.
      • outliers
        • After normalization, we check byplot of each features and we can find out there are some outliers-based on IQR.
        • So we replaced outliers with Q3 value(75% of max value) of each feature.
    • categorical feature
      • Use one-hot encoding to train machine learning model.
  4. Find Optimal K

    • Elbow Method
    • Silhoutte Coefficient
      • To find our optimal K, we used 2 methods mentioned above.
      • For k in range 2 to 15, we drawed graph to check it visually.
      • Through these two method, we could get Optimal K = 6 in common.
  5. Clustering with 6 clusters

    • Use KMeans, DBSCAN, Hierarchical clustering to clustering data.
    • Preceed clustering with k = 6.
    • There are 2 nominal features, Region and Channel, and we can get 6 combinations from those features.
    • And we now can see that those 6 combinations corrrspond to 6 labels one-on-one.

Analysis Clusters

  • 6 clusters are created.

    KMeans Hierarchical
    image image image
    Clusters
    image
  • Result of each cluster.

    Results of wholesale clustering
    image
    image

Reasons for DBSCAN failure

  • DBSCAN is sensitive to density of data.
  • Data in each column are gathered in short range of sales amount (7000-12000).
    Density plot
    image
  • Data are gathered in high and even density.
  • It is hard to find appropriate epsilon and MinPts.
    • Because data are gathered in high density, slight change in hyperparameter occurs huge change in clustering result.