The Importance Of Cluster Analysis

2367 Words10 Pages

1. Introduction
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Cluster analysis is an unsupervised form of learning, which means, that it doesn't use class labels. This is different from methods like discriminant analysis which use class labels and come under the category of supervised learning. K-means is the most simple and popular algorithm in clustering and was published in 1955, 50 years ago.
The advancement in technology has led to many high-volume, high-dimensional data sets. These huge data sets provide opportunity for automatic data analysis, classification …show more content…

Learning means that given some training data set, we want to predict the class labels of the testing data set. Apart from supervised learning in which class labels are known and unsupervised learning in which class labels are unknown, there is a third type of hybrid learning called semi-supervised. In this type of learning, we have class labels for some portion of the training set. But instead of discarding the large portion of training set with unlabelled data, it is also used in the learning process. Instead of using class labels, pair-wise constraints are used. According to the must-link constraint two objects should be assigned to the same cluster while cannot-link constraint specifies that the cluster labels of two objects should be different.

2. Data Clustering
"The goal of cluster analysis is to discover natural grouping of a set of patterns, points or objects."
Clustering can be defined on the basis of similarity, such that the intraclass variation is low while the interclass variation is high. Clusters differ in terms of shape, size and density. If there is noise in the data, then detection of cluster becomes even more difficult. "An ideal cluster can be defined as a set of points that is compact and isolated." In reality, the interpretation of cluster requires domain knowledge. Even though humans can seek clusters in two and three dimensions, algorithms are required for high dimensional data. In addition to this, the number …show more content…

Examples: images, text, audio, video etc. They don't follow any specific format. Structured data on the other hand has semantic relationships within objects. Most clustering approaches use a vector based feature representation, instead of the structures in the object.
Clustering ensembles
This method earlier used for supervised learning, is now also done for unsupervised learning. Multiple partitions called clustering ensembles are obtained by taking multiple looks at the same data. These multiple partitions are combined together and give a good partitioning result, even if the individual clusters were not good enough. Multiple partitions can be generated in various ways. Applying different clustering algorithms, applying same algorithm with different parameters or combining different feature representations and clustering algorithms are some of them.
Semi supervised learning
Any extra information along with the n x d pattern matrix or n x n similarity matrix helps in determining a good cluster. The algorithm using the extra information is said to be operating in a semi supervised mode of learning. This side information can be specified in forms of constraints like must-link and cannot-link, or seeding, where small amount of labelled data is given along with large unlabelled

Open Document