# A Summary of Clustering Methods

This post a summary of clustering methods I learned from Data Mining course at UMN.

Basically, clustering methods can be divided into two types: hierarchical and partitional.

## Type of Clusters

Well-Separated Clusters

A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. Center-based clusters

A cluster is a set of objects such that an object in a cluster is closer (moresimilar) to the “center” of a cluster, than to the center of any other cluster.

The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster. Contiguity-based clusters A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

Density-based clusters A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present.

## Clusters Algorithms

More about K-meanas and Hierarchical clustering can be found in another post.

### K-means Clustering

Sum of Squared Error (SSE) is usually used to evaluate K-means.

$SSE=\sum_{i=1}^K\sum_{x\in C_i}\text{dist}^2(m_i,x)$

K-means has problems when clusters are of differing sizes, densities and non-globular shapes.

K-means has problems when the data contains outliers.

Choosing initial centroids is very important for K-means.

### Hierarchical Clustering

Strengths of Hierarchical Clustering:

• Do not have to assume any particular number of clusters.
• They may correspond to meaningful taxonomies.

Key operation is the computation of the proximity of two clusters.

We have the following ways to define the inter-cluster distances: MIN, MAX, Group Average, Distance Between Centroids

MIN: Single link, can handle non-elliptical shapes. However, single link is sensitive to noise and outliers MAX: Complete link, less susceptible to noise and outliers. The limitations of MAX is:

• Tends to break large clusters
• Biased towards globular clusters Group Average: is a compromis between single and complete Link.

Ward’s Method: Similarity of two clusters is based on the increase in squared error when two clusters are merged.

### DBSCAN

lDBSCAN is a density-based algorithm.

• Density = number of points within a specified radius (Eps)
• A point is a core point if it has at least a specified number of points (MinPts) within Eps
• A border point is not a core point, but is in the neighborhood of a core point
• A noise point is any point that is not a core point or a border point The DBSCAN algorithm is resisitant to noise, and can handle clusters of different shapes and sizes. DBSCAN can fail when the data has varying densities and high-dimensional data.

## Cluster Validity

Entropy and SSE can be used to evaluate the performance of clustering.