Unsupervised Learning Methods

Unsupervised Learning is one of the types of machine learning techniques that are used to explore the undetected patterns present in the dataset with predefined labels and with minimal supervision.

Unlike Supervised learning, it does not require any training and guidance to the machine. Therefore, the machine will try to find the hidden structure and complex puzzles in the data on their own. 

In Unsupervised learning, we used to group cases according to the data present. Groups formed on the principle of Density Estimation method that tries to make the clusters of the data set on the basis of the distribution of the data.

In this article, we will overview the different types of techniques and methods we used in Unsupervised learning. Also, look on the application and the use cases of these techniques.

Some of the most common algorithms used in Unsupervised learning includes:  

1. Clustering: 

The most common Unsupervised learning technique is clustering. In this technique, you try to group the data into the different clusters such that the similar or identical data points come under one cluster. Its main goal is to put all similar data points at a single cluster and by this label the set of unlabelled data.

It’s commonly use in exploratory data analysis of the dataset to find knowledge about the structures of the data.

There are different ways to find the best cluster for different types of data. 

Types of Clustering;

1. Centroid Based Clustering: 

As it can be understand by its name that it depends on the centroid i.e., cluster centres and you try to assign the data points under the clusters on the basis of centroid such that the distance of data points are minimal to the centre point of the cluster which is also refer as centroid. The algorithm minimizes the dissimilarity function.

K-means is the most commonly used algorithm which works on the principle of centroid based clustering. 

It’s a very effective, efficient and simple clustering algorithm but also sensitive to the outliers and initial condition.

Centroid based clustering is used in performing exploratory data analysis and image preprocessing.

2. Distribution Based Clustering:

 In the distribution based clustering, we assume the data is distribute as per Normal or Gaussian Distribution. Data Points have a maximum density at the distribution centre and as we go far i.e., distance from the centre increases, density or distribution of the data points will decrease. So, the decrease in a probability distribution that a data point belongs to as a distance from the distribution centre increases.

Unlike K-means, distribution-based clustering does not necessarily mean that all clusters should be round in shape. 

It also overcomes the one more limitation of K-mean that all data points can assign in one and only one cluster, data points can’t overlap but distribution based clustering use soft clustering that we can assign one data point in more than one cluster or data points can be overlapped.

The Gaussian mixture model is however one of the popular algorithms which is based on the distribution based clustering technique.

3. Density Based Clustering:

In Density-based clustering technique we try to divide a data point based on the different density regions present in the data space. It allows the different shapes for clustering according to the density present at which part of data space. So, it assigns the regions of high density that are separated from one another by regions of low density.

The performance of the model reduces with the data of high dimensionality and if data density varies.

These algorithms are non-sensitive for the outliers i.e., don’t assign outliers to the cluster. It can also easily find arbitrary shapes and arbitrary size clusters.

DBSCAN is the most popular clustering algorithm which works on the principle of density based clustering. thus, DBSCAN is widely used in the Big Data Warehouses where large multidimensionality databases are maintained.

4. Hierarchical Clustering:

You are eventually building a hierarchy of clusters that’s why this algorithm is named as hierarchical clustering. It creates a tree-like structure and tries to assign all data points in the cluster and has the advantage that we don’t need to assign clusters before. It’s not necessary to choose the number of clusters as it can be chosen by cutting the tree at the right level. 

One of the drawbacks of hierarchical clustering is that it’s too slow for large datasets.

It’s basically categorize into 2 types:

  1. Agglomerative Hierarchical Clustering: It takes a bottom-to-up approach.
  2. Divisive Hierarchical Clustering: It takes a top-to-down approach.

2. Dimensionality Reduction:

It is defined as the number of characteristics, columns or features present in the dataset. Dimensionality reduction means to reduce the number of features to decrease the complexity of the model. It’s a data preprocessing technique applies to data prior to modelling. 

It is broadly divide up into 2 categories that are Feature Selection and Feature Extraction. In feature selection, we select the features that can be dropped. In feature extraction, we used to derive one new feature from the pre-existing features and can combine 2 or more features in one new one.

When we have a huge amount of data with a large number of dimensions, it results in overfitting and hence a poor performance which refers to the Curse of Dimensionality.

There are a few parameters by which you can reduce the dimensionality of the data. We can drop anyone in the highly correlated columns. Domain knowledge is very important in reducing the dimensionality of the data. 

In Unsupervised learning, many models have a transform function that reduces the dimensionality of data on their own. PCA and ICA are the few component analysis methods for dimensionality reduction.

3. Anomaly detection: 

however, Anomaly simply means the outliers, irregularity, noise, uncertainty in the data. Anomaly Detection refers to the detection of abnormal things or data points in the dataset. 

Anomaly detection in unsupervised learning does not require any training data with manual labelling of the data. These methods are usually based on the statistical assumption that most of the inflowing data are normal and only a little percentage of malicious data. So, these methods define that anomalous data is different from normal data.

There are a few use cases also where we train our model to detect or identify the anomaly or any unusual behaviour in the data to detect the fraud and specious activity in our data. So anomalous behaviour represents the problem we use it in,  Credit card fraud detection,  in medical problems, in cyber attack etc. 

Some of the types of the anomaly are:

  • Time-series anomaly detection
  • Video-level anomaly detection
  • Image-level anomaly detection 

4. Association Rule-Mining:

The Association Rule algorithm is a rule-based learning technique that tries to learn the hidden structure on its own as data is not labelled. thus, It’s a descriptive method which is used to discover the hidden information and interesting facts hidden in the large datasets. so, The information is present in the form of rules. 

Association rule mining helps to find out the frequently occurring items or frequently occurring combination of items in a huge set of items. We have 2 main terms that we use frequently in Association rule that is;

  • Support: It gives the probability of occurring the specific items.
  • Confidence: It gives the conditional probability of occurring the combination of itemsets
  • Lift: The ratio between support and confidence.

So, it is one of the ways to find patterns in data i.e., for finding features(dimensions) which occur together and features(dimensions) which correlates.

We have 2 methods in the Association rule:
  • Apriori Algorithm: Used to find the frequently occurring combination of itemsets.
  • Market Basket Analysis(MBA): therefore, Use to find combining fast-moving and slow-moving itemset(Product) by combo.

so Its main applications are Basket data analysis, Cross-marketing, catalog design etc.  

written by: Sachin Yadav

reviewed by: Vikas Bhardwaj

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs

Leave a Comment

Your email address will not be published. Required fields are marked *