Monday, May 18, 2015

Cluster analysis

Cluster analysis, a popular unsupervised learning technique refers to not one but various methods of grouping similar objects together. The aim is to create clusters/groups such that objects in the same group are more similar (a.k.a. intra-cluster homogeneity is high) to each other than to objects in other clusters (inter-cluster heterogeneity is high). 

Cluster analysis is tricky. I heard some people,including some profs say that they don't like cluster analysis, and I can understand why they feel so, but I have come to realize that it may be the most interesting thing I have come across till now. If you know how to do it right, it can help you discover behavioral patterns that you never knew existed. 

In this post, I will explore some basic cluster analysis methods. 

Data set: 

The data set I have used is called Wholesale customers data set from UCI Machine Learning repository. It contains the annual spending of clients of a wholesale distributor on different product categories. I read somewhere that its perhaps one of the best data sets to try for cluster analysis.  Refer the link for more details on the dataset: https://archive.ics.uci.edu/ml/datasets/Wholesale+customers 

Explore data: 

  • Channels contain more variability than the regions (Example graph above). Similar patterns emerge for other variables.
  • Strong correlation is found in - Grocery & Detergent - 0.92, Milk & Det - 0.66 and Milk & Grocery - 0.73
  • Looking at various graphs such as the one above, we can roughly estimate 3 clusters



Hierarchical clustering:




  • Different hierarchical clusterings dendrograms roughly categorize data into 3 or 4 groups
  • Some observations seem to be separated out consistently (e.g. 86, 87 even though its hard to read here)

K-means: 

Running K-means with size=3,
I wanted to capture more variability in data and found that if we run Kmeans with Grocery, Detergent_Paper and Milk, we can capture 97% variability in data.

Finally, here's my interpretation of the clusters: 
What do you think? Please let me know. Happy learning!

References: 

https://archive.ics.uci.edu/ml/datasets/Wholesale+customers 

No comments: