A Summary of Clustering Methods
This post a summary of clustering methods I learned from Data Mining course at UMN.
Basically, clustering methods can be divided into two types: hierarchical and partitional.
Type of Clusters
Well-Separated Clusters
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.
Center-based clusters
A cluster is a set of objects such that an object in a cluster is closer (moresimilar) to the “center” of a cluster, than to the center of any other cluster.
The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster.
Contiguity-based clusters
A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
Density-based clusters
A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present.
Clusters Algorithms
More about K-meanas and Hierarchical clustering can be found in another post.
K-means Clustering
Sum of Squared Error (SSE) is usually used to evaluate K-means.
\[SSE=\sum_{i=1}^K\sum_{x\in C_i}\text{dist}^2(m_i,x)\]K-means has problems when clusters are of differing sizes, densities and non-globular shapes.
K-means has problems when the data contains outliers.
Choosing initial centroids is very important for K-means.
Hierarchical Clustering
Strengths of Hierarchical Clustering:
- Do not have to assume any particular number of clusters.
- They may correspond to meaningful taxonomies.
Key operation is the computation of the proximity of two clusters.
We have the following ways to define the inter-cluster distances: MIN, MAX, Group Average, Distance Between Centroids
MIN: Single link, can handle non-elliptical shapes.
However, single link is sensitive to noise and outliers
MAX: Complete link, less susceptible to noise and outliers.
The limitations of MAX is:
- Tends to break large clusters
- Biased towards globular clusters
Group Average: is a compromis between single and complete Link.
Ward’s Method: Similarity of two clusters is based on the increase in squared error when two clusters are merged.
DBSCAN
lDBSCAN is a density-based algorithm.
- Density = number of points within a specified radius (Eps)
- A point is a core point if it has at least a specified number of points (MinPts) within Eps
- A border point is not a core point, but is in the neighborhood of a core point
- A noise point is any point that is not a core point or a border point
The DBSCAN algorithm is resisitant to noise, and can handle clusters of different shapes and sizes.
DBSCAN can fail when the data has varying densities and high-dimensional data.
Cluster Validity
Entropy and SSE can be used to evaluate the performance of clustering.