Skip to content

Commit

Permalink
renaming
Browse files Browse the repository at this point in the history
  • Loading branch information
fderyckel committed Apr 20, 2024
1 parent af91100 commit b891a7b
Show file tree
Hide file tree
Showing 162 changed files with 4,134 additions and 922 deletions.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"hash": "4601c3d661d6229f5c564feadb0eddfc",
"result": {
"markdown": "---\ntitle: \"Intro to Kmeans\"\nauthor: \"Francois de Ryckel\"\ndate: '2022-10-31'\ncategories: [Kmeans, ML]\ndescritpion: 'An introduction on the Kmeans algorithms.'\neditor: source\ndate-modified: '2022-10-31'\n---\n\n\nThe purpose of this article is to get a deeper dive into Kmeans as an unsupervised machine learning algorithms. To see the algorithm at work on some financial data, you can use [the post](../../kmeans-regime-change/index.qmd) on kmeans and regime change. \n\nAs mentioned, Kmeans is an unsupervised learning algorithm, aiming to group “observations” based on their distances to a fixed number of “centroids” (aka: the fixed number is K). Each centroids is defined as the mean (in each dimension) of all the points belonging to that centroid. All the points belonging to a centroid makes a cluster: all the observations belonging to the k-centroid makes the k-cluster. \n\nThe objective is, for a given number of centroids (i.e. k), to **minimize the total within-cluster variation** or intra-cluster variation (distance from the observation to the centroid). \n\nThe standard Kmeans algorithm aims to minimize the total sum of the square distances (Euclidean distance) between observations and centroids. \n\nFirst, we calculate the within-centroid sum of square: \n$$W(C_k) = \\sum_{x_i \\in C_k} (x_i - \\mu_k)^2$$ {#eq-sum-of-square-for-a-cluster}\n\n* $x_i$ is the ith-observation in our dataset\n* $k$ is the number of centroids and $C_k$ is the $k^{th}$ centroid. \n* $\\mu_k$ is the mean value of all the points assigned to the $C_k$ cluster. You get the mean of the centroid is a p-dimensional vector comprising the mean of each of the variables. \n\nThe objective is to minimize the total within cluster variation, that is the total sum of the distance of each observations to its centroid and that for each of the k-centroids: \n$$tot.withinss = \\sum_{k=1}^k W(C_k) = \\sum_{k=1}^k \\sum_{x_i \\in C_k} (x_i - \\mu_k)^2$$ {#eq-total-within-cluster-sum-of-square}\n\n# The Kmeans algorithms \n\nAlthough there are 3 different ways to achieve the objective function, we'll only explain the *Hartingan-Wonk algorithm* (1979). \nIt is noted that none of the algorithms guarantees a global minima. They all can be stuck into a local minima. \n\n\n* **Step 1** Choose the number of centroids \n* **Step 2: Cluster assignment step** All observations are randomly assigned a cluster (the first k centroids). Another way to start is to randomly pick k observations that will be used as first centroids in the given hyperspace. \n* **Step 3: Calculate the centroid location** Find the centroid location by calculating the mean value for each of the variables. \n* **Step 4: Find distance to each centroids** Find the distance between each observation and each centroid. Assign the centroid with minimum distance to that observation.\n* **Step 5: Centroids update** Recalculate the centroid location if there was a change of centroid mean. In other words: calculate a new mean for each centroids using all the $x_i \\in C_k$.\n* **step 6: Reaching convergence** Repeat step 4 and 5. Once the new mean is established, we calculate again, for each observations its closest centroid. Then calculate again a new mean. We reiterate this step this until each given centroids are not changing anymore (or the change is below a given threshold). \n\nBecause of the initial random allocation of points to a cluster, results might differ. This is why in the *native* R *kmeans()* you can specify the *nstart* parameter and only the allocation providing the minimum distance will be kept. \n\n## Kmean - Hartingan-Wonk in practice. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# setting up the problem \nlibrary(tibble)\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(glue)\n\nnum_obs = 50\nset.seed(1234)\ndf <- tibble(x = runif(num_obs, -10, 10), y = runif(num_obs, -10, 10))\n\n# Step 1: choose number of centroids\nk = 3\n\nset.seed(441)\n# step 2: assign randomly each point to a cluster\ndf <- df |> mutate(centroid = sample(1:k, n(), replace = TRUE))\n```\n:::\n\n\nWe are going to define a couple of function we will use more than once. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncalculate_centroid_mean <- function(df) {\n yo <- df |> group_by(centroid) |> \n summarize(mean_x = mean(x), mean_y = mean(y)) \n #print(yo)\n return(yo)\n}\n\ntotal_within_cluster_distance <- function(df) {\n yo <- df |> \n mutate(distance = (x - centroid_loc$mean_x[centroid])^2 + \n (y - centroid_loc$mean_y[centroid])^2)\n return(sum(yo$distance))\n}\n\n# need to initialize an empty vector for the distance of the observation to a centroid\ndist_centroid <- c()\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Starting the iteration process \n\n# step3: \n## a: calculate mean of each centroids\ncentroid_loc <- calculate_centroid_mean(df)\n\n## b: calculate sum of distance between each observation and its assigned cluster \nprint(glue('Initial sum of within-cluster distance is: ', \n round(total_within_cluster_distance(df), 2)))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nInitial sum of within-cluster distance is: 3014.94\n```\n:::\n\n```{.r .cell-code}\nrd = 1 # to keep track on how many round / loops we are doing\ni = 0 # to keep track on how many observations have not changed their centroids\n\n# we shall be running the loop until we have no more change of centroids \nwhile (i < num_obs) { \n \n i = 0 # we keep going with the process until no more change of centroids for all the observations\n \n # Step 4: for each data point \n for (obs in 1:num_obs) { \n \n # for each centroid \n for (centroid in 1:k) { \n # find distance from the observation to the centroid \n dist_centroid[centroid] = sqrt((df$x[obs] - centroid_loc$mean_x[centroid])^2 + \n (df$y[obs] - centroid_loc$mean_y[centroid])^2) \n # print(glue(' The distance from point ', obs, ' to centroid ', centroid, ' is ', round(dist_centroid[centroid], 2))) \n } \n \n # assign the observation to its new centroid (based on min distance) \n prev_centroid = df$centroid[obs] \n post_centroid = which.min(dist_centroid) \n df$centroid[obs] = post_centroid #assign the new centroid \n \n if (prev_centroid != post_centroid) { \n # we recaluate the centroid\n centroid_loc <- calculate_centroid_mean(df) \n print(glue(' The initial centroid for point ', obs, ' was ', \n prev_centroid, '. It is now ', post_centroid)) \n # print(centroid_loc) \n } else {\n i = i + 1\n #print(' No change in centroid')\n }\n }\n rd = rd + 1\n print(glue('Round ', rd, ' The new sum of within-cluster distance is: ', \n round(total_within_cluster_distance(df), 2)))\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n The initial centroid for point 3 was 3. It is now 1\n The initial centroid for point 5 was 2. It is now 3\n The initial centroid for point 6 was 3. It is now 1\n The initial centroid for point 8 was 1. It is now 2\n The initial centroid for point 10 was 1. It is now 2\n The initial centroid for point 11 was 1. It is now 2\n The initial centroid for point 12 was 2. It is now 3\n The initial centroid for point 13 was 2. It is now 3\n The initial centroid for point 14 was 1. It is now 3\n The initial centroid for point 15 was 2. It is now 3\n The initial centroid for point 16 was 1. It is now 2\n The initial centroid for point 19 was 1. It is now 3\n The initial centroid for point 20 was 3. It is now 2\n The initial centroid for point 21 was 1. It is now 3\n The initial centroid for point 23 was 1. It is now 3\n The initial centroid for point 24 was 1. It is now 2\n The initial centroid for point 25 was 1. It is now 3\n The initial centroid for point 26 was 2. It is now 1\n The initial centroid for point 28 was 1. It is now 3\n The initial centroid for point 30 was 3. It is now 2\n The initial centroid for point 31 was 1. It is now 2\n The initial centroid for point 32 was 1. It is now 2\n The initial centroid for point 33 was 2. It is now 3\n The initial centroid for point 34 was 1. It is now 2\n The initial centroid for point 35 was 2. It is now 3\n The initial centroid for point 37 was 3. It is now 2\n The initial centroid for point 38 was 2. It is now 3\n The initial centroid for point 39 was 3. It is now 1\n The initial centroid for point 44 was 2. It is now 3\n The initial centroid for point 45 was 2. It is now 3\n The initial centroid for point 48 was 2. It is now 3\n The initial centroid for point 49 was 2. It is now 3\nRound 2 The new sum of within-cluster distance is: 1594.37\n The initial centroid for point 2 was 3. It is now 1\n The initial centroid for point 3 was 1. It is now 2\n The initial centroid for point 5 was 3. It is now 1\n The initial centroid for point 9 was 3. It is now 1\n The initial centroid for point 14 was 3. It is now 1\n The initial centroid for point 17 was 1. It is now 3\n The initial centroid for point 28 was 3. It is now 1\n The initial centroid for point 37 was 2. It is now 3\n The initial centroid for point 41 was 3. It is now 1\n The initial centroid for point 44 was 3. It is now 1\nRound 3 The new sum of within-cluster distance is: 1146.17\n The initial centroid for point 7 was 2. It is now 3\n The initial centroid for point 12 was 3. It is now 1\n The initial centroid for point 32 was 2. It is now 3\nRound 4 The new sum of within-cluster distance is: 1081.9\n The initial centroid for point 18 was 2. It is now 3\nRound 5 The new sum of within-cluster distance is: 1071.19\nRound 6 The new sum of within-cluster distance is: 1071.19\n```\n:::\n\n```{.r .cell-code}\nprint(centroid_loc)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n centroid mean_x mean_y\n <int> <dbl> <dbl>\n1 1 4.41 -4.99\n2 2 -0.0601 5.27\n3 3 -5.04 -5.47\n```\n:::\n\n```{.r .cell-code}\nggplot(df, aes(x, y)) + \n geom_point(aes(color = as.factor(centroid))) + \n geom_point(data = centroid_loc, aes(x = mean_x, y = mean_y)) + \n theme(legend.position = 'none')\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nyoo <- kmeans(df[, 1:2], centers = 3, \n algorithm = 'Hartigan-Wong', nstart = 10)\nyoo$tot.withinss\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1017.417\n```\n:::\n\n```{.r .cell-code}\nlibrary(broom)\ndf2 <- df\naugment(yoo, df2) |> \n ggplot(aes(x, y)) + \n geom_point(aes(color = .cluster)) + \n geom_point(data = as_tibble(yoo$centers), aes(x, y)) + \n theme(legend.position = 'none')\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\n\n# The parameters \n\nWhile using K-means, there are 3 main parameters one can tune. \n\n* the number of clusters: this is the main one to tune to avoid both under-fitting (too low of a K) or over-fitting (to high of a K). If K = equal number of observation, of course the total within cluster variation will be 0 and minimized! \n* the number of iteration \n* the number of start\n\n## Number of clusters \n\n### The elbow method \n\nThe idea is to identify where does drop in the total within-cluster sum of square start to slowdown. Of course the total within-clusters sum of square decrease as the number of centroids increase. If we have n centroids (that is $n = k$ - as many centroids as observations), the total within-cluster sum of square will be 0. And if we have only one centroid, the total within-one-cluster sum of square will be the sum of square of the mean of each of the variables. \nSo when does adding a centroid does not significantly reduce the total within-cluster sum of square. \n\n\n### The silhouette method \n\n## Number of iterations\n\n## Number of start \n\n\n# Vizualisation \n\nA common trick is to use PCA and check the how well data are separated (in regards to their clusters) using the first 2 principal components dimensions. \n\n\n# Pro & Con \n\n\n\n## con \n\n* You don't always know in advance thee number of centroids. You can use the **elbow method** or the **silhouette method** to determine the numbers of centroids you want. \n* because of the random initialization stage, results might not necessarily be reproducible. If results have to be reproduced, then you need to set a seed. \n\n\n",
"supporting": [
"index_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit b891a7b

Please sign in to comment.