Cluster analysis, a popular unsupervised learning technique refers to not one but various methods of grouping similar objects together. The aim is to create clusters/groups such that objects in the same group are more similar (a.k.a. intra-cluster homogeneity is high) to each other than to objects in other clusters (inter-cluster heterogeneity is high).
Cluster analysis is tricky. I heard some people,including some profs say that they don't like cluster analysis, and I can understand why they feel so, but I have come to realize that it may be the most interesting thing I have come across till now. If you know how to do it right, it can help you discover behavioral patterns that you never knew existed.
In this post, I will explore some basic cluster analysis methods.
Data set:
The data set I have used is called Wholesale customers data set from UCI Machine Learning repository. It contains the annual spending of clients of a wholesale distributor on different product categories. I read somewhere that its perhaps one of the best data sets to try for cluster analysis. Refer the link for more details on the dataset: https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
Explore data:
- Channels contain more variability than the regions (Example graph above). Similar patterns emerge for other variables.
- Strong correlation is found in - Grocery & Detergent - 0.92, Milk & Det - 0.66 and Milk & Grocery - 0.73
- Looking at various graphs such as the one above, we can roughly estimate 3 clusters
Hierarchical clustering:
- Different hierarchical clusterings dendrograms roughly categorize data into 3 or 4 groups
- Some observations seem to be separated out consistently (e.g. 86, 87 even though its hard to read here)
K-means:
Running K-means with size=3,I wanted to capture more variability in data and found that if we run Kmeans with Grocery, Detergent_Paper and Milk, we can capture 97% variability in data.
Finally, here's my interpretation of the clusters:
What do you think? Please let me know. Happy learning!
No comments:
Post a Comment