What is K-Means Clustering?
(https://i.stack.imgur.com/KPjMy.png)
K-Means Clustering is the distribution of a group of sub-set observations (called clusters) such that in any way that observations in the same cluster are identical. Clustering is a form of unsupervised learning used in many fields and a popular methodology for computational data processing.
K-means Clustering
K-Means Clustering is an algorithm for classifying or grouping the objects into K group numbers based on attributes/features. Here K is an integer number that’s positive. Based on the characteristics given, the algorithm performs iteratively to assign each data point to one of the K groups. On the basis of feature similarities, data points are clustered.
(https://www.gatevidyalay.com/wp-content/uploads/2020/01/K-Means-Clustering.png)
By minimizing the number of squares of distances between data and the related cluster centroid, the classification is achieved. K-mean clustering is also intended to characterize the data
The K-Means Clustering can be used to:
- Find the centroid of K clusters, which can be use to label new data.
- Labels can be assign to these clusters, therefore we can assign a label to each data point in the dataset.
A set of feature values that characterize the resulting groups is each centroid of a cluster. Examining the weights of the centroid feature can be use to interpret qualitatively what kind of group each cluster represents.
Algorithm of K-means Clustering
The K-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters K and the data set. For each data point, the data set is a series of functions. The algorithms begin with initial K centroid estimates, which can be either generate randomly or chosen randomly from the data set.
These are the steps to be follow:
- Start by determining on the value of k that is the number of clusters.
- Put any initial partition in k clusters that classifies the data. You may randomly, or systematically, delegate the training samples as follows:
- Take the first k training sample as single-element clusters.
- Assign each of the remaining (N-k) training sample to the cluster with the nearest centroid. After each assignment, recomputed the centroid of the gaining cluster.
- In sequence, take each sample and measure the distance from the centroid of each of the clusters. If a sample is not already in the nearest centroid cluster, move this sample to that cluster and update the new sample centroid of the cluster and the cluster losing the sample.
- Repeat step 3 until convergence is achieve by, that is until a pass through the training sample causes no new assignments.
- If the number of data is less than the number of the cluster, so each data is allocate as the cluster centroid. Each centroid would have a number for the cluster.
- If the number of data is greater than the number of clusters, we determine the distance to all centroids for each details and get the minimum distance. thus, this data belongs to a cluster with a minimum distance from that data.
The algorithm for this method can be define as:
Iterate until stable :
- Determine the centroid coordinate.
- Determine the distance of each object to the centroids.
- Group the object based on minimum distance.
Choosing The value of K
For a particular pre-chosen K, the algorithm mentioned above finds the clusters and data set labels. The user needs to run the K-means clustering algorithm for a set of K values and compare the results in order to find the number of clusters in the data. In general, there is no formula for calculating the exact value of K, but using the following methods, a reliable estimate can be achieved.
The mean distance between data points and their centroid cluster is one of the metrics widely in use to equate outcomes across various K values. Since increasing the number of clusters will always decrease the distance to data points, increasing K will always decrease this metric to the point of achieving zero until K is equal to the number of data points. Therefore, as the sole goal, this metric cannot be used. Instead the mean distance to the centroid is plot as a function of K, and the “elbow point,” where the rate of decrease changes abruptly, can be use to calculate K roughly.
For K validation, there are a variety of other methods, including cross-validation, information criteria, the information theoretic jump method, the silhouette method, and the G-means algorithm. Furthermore, tracking the distribution of data points across classes gives insight into how the algorithm splits the data for each K.
Advantages and Disadvantages of K means Clustering
Advantages of K-means clustering includes:
- It Guarantees convergence.
- The K-means algorithm can quickly conform to the modifications. If there are any concerns, modifying the cluster section would cause the algorithm to be quickly updated.
- The implementation of k-means and thus the identification of unknown groups of data from complex data sets is simple. The results explains in an easy and simple way.
- When working with spherical clusters, this mode of clustering works well. Because each cluster is spherical, it works with an expectation of mutual distributions of characteristics. Both the characteristics or characters of the clusters have similar variation and each is independent of each other.
Disadvantages of K-means clustering includes:
- K-means also do not allow an optimum range of clusters to be formed and you should settle on the clusters beforehand for successful results.
- It however gives various effects on different iterations of an algorithm. A random collection of cluster patterns produces multiple clustering outcomes that result in inconsistency.
- The final results will be completely alter by modifying or rescaling the dataset by either normalization or standardization.
Written By: Chaitanya Virmani
Reviewed By: Krishna Heroor
If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs