Web-Scale K-Means Clustering

Management and analysis of physical dataset project

Implement and benchmark alternatives of common clustering algorithms in Spark environment, without using the related already provided functions.

The project is thus focused on the efficient implementation of algorithms in a distributed system.

main topics:

Mini-batch k-Means, K-means ++, K-means ||

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
12partitions.jpeg		12partitions.jpeg
24partitions.jpeg		24partitions.jpeg
Amdahls_Law.png		Amdahls_Law.png
K-means parallel.ipynb		K-means parallel.ipynb
K-means.ipynb		K-means.ipynb
Plots.ipynb		Plots.ipynb
README.md		README.md
Report_KMean_MiniBatch_withSpark.ipynb		Report_KMean_MiniBatch_withSpark.ipynb
amdahls_law_ourP.png		amdahls_law_ourP.png
amdahls_law_vs_our_with_Ebars.png		amdahls_law_vs_our_with_Ebars.png
efficiency_updated.png		efficiency_updated.png
gustafsons_law.png		gustafsons_law.png
gustafsons_law_notcut.png		gustafsons_law_notcut.png
minibatch_exp_681626_k10_t1_34		minibatch_exp_681626_k10_t1_34
partitions_exp_647721_b1000_k10_t1_121		partitions_exp_647721_b1000_k10_t1_121
partitions_exp_661681_b1000_k10_t1_62		partitions_exp_661681_b1000_k10_t1_62
partitions_exp_678382_b1000_k10_t1_34		partitions_exp_678382_b1000_k10_t1_34
vm_stats.jpeg		vm_stats.jpeg