K-Means

k-means algorithm is commonly used in clustering. Clustering is a technique which is used in identifying the same subgroups of data points with the help of Euclidean-based distance or correlation-based distance.

Cluster is used in market segmentation, image segmentations etc.  

K-means algorithm is used an iterative approach that tries to divide the dataset into K pre-defined distinct non-overlapping subgroups where each data point belongs to only one group. It will try to make the intra-cluster data points as similar as possible while also keeping the clusters as different as possible.

however, It assigns dataset’s points to a cluster such that the sum of the squared distance between the dataset’s points and the clusters is at the minimum distance. The less variation we have within clusters, the more homogeneous the dataset’s points are within the same subgroups(cluster).

steps in the K-Means algorithm:

We have K number of clusters, Initialize centroids by first shuffling the dataset(data points) and then randomly selecting K data points for the centroids without replacement. Keep updating until there is no change to the centroids. So that assignment of data points to (subgroup) clusters isn’t changing.

  • Calculate the sum of the squared distance between data points and centroids.
  • Assign every data point to the nearest cluster (centroids).
  • Calculate the centroids for the (subgroups) clusters by taking the average of all data points that belong to each cluster.

There are two approaches k-means follows to solve the problem refer to asExpectation-Maximization. The E-step is assigning the data points to the nearest (subgroup) cluster. The M-step is calculating the centroids of each (subgroup) cluster. 

The objective function is:

where wik=1 for data point in the datasets(data points)  xi if it belongs to (subgroups)cluster k; otherwise, wik=0.

μk:  is the centroids of xi’s cluster.

It is a minimization problem of two parts. First, we have to minimize J w.r.t. wik and treat μk fixed. After that,  we will minimize J w.r.t. μk and treat wik fixed. we differentiate J w.r.t. wik first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the centroids after the cluster assignments from the previous step (M-step). Therefore, E-step is:

 

Written By: Amit Gupta

Reviewed By: Savya Sachi

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs

Leave a Comment

Your email address will not be published. Required fields are marked *