Saturday, October 10, 2015

LOT to be explored in Big Data Analytics for AML

It is not my intention to digress from Analytics here but between long work hours, deadlines and dealing with all sorts of other crap at work, I am finding it hard to make time for some hands on practice these days. So, these days, I am mostly in reading mode, exploring more on applications of Analytics and how Big Data Analytics is making its way into the thought process of every problem there is in this world.
Having worked in Compliance technologies for sometime, I find Anti-Money Laundering quite interesting. So, I got to exploring more about that and came across some interesting articles.

Here's one that I found interesting. All credits for that article go to the author of the original article.

http://www.cio.com/article/2871684/big-data/how-big-data-analytics-can-help-track-money-laundering.html

Trade Based Money Laundering is a concept that has deep roots in India. I mean, just look around, we still live in a world of cash and kacha receipts, and for good reason but it still poses that question. How do we combat the threat of financing terrorists in a system that's so manual.

Anyway, back to Analytics, what is interesting to me as a learner of Analytics, as I read more on this topic and got to think about it, was the following:

  • AML is an excellent example where outlier detection and analysis is very helpful 
  • TBML at a micro-economic level poses many additional complications related to data gathering as well as Text Analytics because informal receipts don't have a fixed format, can be in any language and worse can be non-existant 
  • Even for formalized trades that are monitored via customs, where the data ingestion may not be such a big issue, unless underlying assumptions of data sharing and transparency across boundaries as well as agreed international standards / calculations are needed to achieve confidence in the accuracy of any type of predictive model 
  • There is potential for IOT in trading. E.g. in case of trades shipped via sea or air, the goods carrier be smart enough to understand what goods its carrying and detect anomalies

I would love to hear other opinions about this. 

Saturday, October 3, 2015

Cheatsheet from Analytics Vidhya

I love cheat sheets! I make and use them all the time at work. So, couldn't help sharing this when I came across it today.

Go to this blog, login and download the cheat sheet. They even have a pdf version.
http://www.analyticsvidhya.com/blog/2015/09/full-cheatsheet-machine-learning-algorithms/

PS: If you don't yet, follow Analytics Vidhya. They are amazing. :) 

Monday, May 18, 2015

Decision Tree - Bank Marketing data set (UCI Machine learning repository)

In the last post, we briefly explored the bank marketing data set from UCI Machine learning repository:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

In this post, lets explore some simple classification techniques.


Naive Rule: 

As we mentioned in the previous post, the response rate in this data set is 11.7%. The naive rule here would be to classify all customers as non-subscribers as 88.3% customers in the training set were non-subscribers. Obviously, we are not going to use this for modeling but we may use this later for evaluating the performance of other classifiers. 


My take on performance consideration: 

Before we move to building more models, I want to talk a little bit more about the issue of unequal classes. Here we are dealing with asymmetric classes i.e. it is more important to predict correctly a potential subscriber than a non-subscriber. Why? Because the cost of making a call is likely much lower than misclassifying a potential subscriber as non-subscriber. In other words, we want high sensitivity and low false negative ratio (high negative pred value).



Naive Bayes:

Naive Bayes uses Bayes theorem to compute the (conditional) probability that a record belongs to an output class given a set of predictor variables.


Decision Tree:

We used CART with a cost matrix. We assumed a FN/FP cost ratio of 10:1 in the final model though we also tried other ratios randomly. Anyway, the tree produced with using a cost matrix was clearly much better than without one as we suspected. 


Performance Results:



Gains Chart: 

Using CART decision tree model, we got lifts of 2.9 and 2.3 in the first and second deciles respectively. The lift curve falls apart beyond the 5th decile though, clearly there's a lot of room for improvement.




So, this was my attempt with the bank marketing data set. Hope you are able to take away something positive from it today. 

Please share your thoughts or critique...have discussions if you will. Just don't forget to leave a comment. :)

Exploring Bank Marketing data set (UCI Machine Learning repository)

Bank marketing data set is a very well known data set available at: 
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing . 

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be ('yes') or not ('no') subscribed. We will work with the older version of the data set (which does not contain the economic parameters). 


Clearly, this is a classification problem and we are interested in identifying the clients that are likely to be subscribers (the yes class).


In this post, we are going to focus on Data Exploration. The next post will discuss some basic classification techniques.

First things first, we looked at data distribution - overall and within the subscriber population. The Response Rate (% subscribers) is 11.70% and like most marketing data sets, this one too is unbalanced.

Some other observations include,
  • 52% of the clients who subscribed were married
  • 99.02% of the clients who subscribed did not have any credit default
  • 63% of subscribers did not have housing loans. 93% did not have a personal loan (Low risk investors)
  • 82.61% of subscribers were contacted via cellular
  • 71.33% of contacts made were between May and Aug. Within the subscribers, 52.68% of contacts were made between these months.
  • 63.98% subscribers were not contacted for past campaigns, so, Poutcome for those is unknown for these subscribers

This data is ordered from May 2008 to Nov 2010 but there is no year variable in the data set, so we added it. We can see that the conversion rate in 2010 was much higher than previous years although the no. of calls reduced.
We also derived and added day of the week variable from the date, month and year variables.


Duration:

When duration = 0, y = no. There are 3 records where duration = 0
Median duration of the last contact is higher for subscribers than for non-subscribers. However, to keep in mind that for a subscriber, duration is not known till the call is over.


Null values: 

There are no null values but some categorical variables have a category called unknown. We will treat these as a valid category for now. 


Reference: 

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing 

Cluster analysis

Cluster analysis, a popular unsupervised learning technique refers to not one but various methods of grouping similar objects together. The aim is to create clusters/groups such that objects in the same group are more similar (a.k.a. intra-cluster homogeneity is high) to each other than to objects in other clusters (inter-cluster heterogeneity is high). 

Cluster analysis is tricky. I heard some people,including some profs say that they don't like cluster analysis, and I can understand why they feel so, but I have come to realize that it may be the most interesting thing I have come across till now. If you know how to do it right, it can help you discover behavioral patterns that you never knew existed. 

In this post, I will explore some basic cluster analysis methods. 

Data set: 

The data set I have used is called Wholesale customers data set from UCI Machine Learning repository. It contains the annual spending of clients of a wholesale distributor on different product categories. I read somewhere that its perhaps one of the best data sets to try for cluster analysis.  Refer the link for more details on the dataset: https://archive.ics.uci.edu/ml/datasets/Wholesale+customers 

Explore data: 

  • Channels contain more variability than the regions (Example graph above). Similar patterns emerge for other variables.
  • Strong correlation is found in - Grocery & Detergent - 0.92, Milk & Det - 0.66 and Milk & Grocery - 0.73
  • Looking at various graphs such as the one above, we can roughly estimate 3 clusters



Hierarchical clustering:




  • Different hierarchical clusterings dendrograms roughly categorize data into 3 or 4 groups
  • Some observations seem to be separated out consistently (e.g. 86, 87 even though its hard to read here)

K-means: 

Running K-means with size=3,
I wanted to capture more variability in data and found that if we run Kmeans with Grocery, Detergent_Paper and Milk, we can capture 97% variability in data.

Finally, here's my interpretation of the clusters: 
What do you think? Please let me know. Happy learning!

References: 

https://archive.ics.uci.edu/ml/datasets/Wholesale+customers 

Sunday, April 5, 2015

Why analytics for me?

Why I am excited about data analytics? First of all, who isn't? It has opened up such a promising era of opportunities that no one wants to miss. For some, it promises an exciting career where their work would hold power to make a real difference. For some, it's a train that they are afraid to miss because they don't want to be left behind. At some level, many people don't have a choice. Then why am I pondering on this? It's because, well, mine's a third reason and if you graduated from an Indian university with comp. sci. engineering degree around the time I did, it might resonate with you.

Back in my engineering days, around the 7th or 8th semester, I came across the term 'data mining'. I think it was in advanced databases. Anyway, there was a whole... paragraph... about it and it was very interesting but very less information. Unable to find anything about it in the library, I went to the newly opened internet facility at the college, applied my best google skills at the time but couldn't find much. Google was already the answer to everything by then, but you see, back in 2004, the internet speed and access in India were far behind the western world. Even today, it's worse than it's supposed to be for a country that prides itself as an IT hub but back then, public access to Internet was not so widespread. And the speed...oh my...was quite terrible. Anyway, so I tried this for a couple of days but then the realities of student life took over and eventually, it became one of the things we hoped to see more of in the 'near future'.

Now, I am not saying that the information was not there but I simply couldn't get my hands on it. And even if I had, I am not sure what I would have done with it at the time. But I was certainly intrigued by that one paragraph of information.

Fast forward to 2012. I had been working for almost a decade, grown, traveled and seriously started to contemplate about what I want to do more. I like coding but not as the end but a means to the end. At the same time, I don't want to distant myself from the technology field; I am still a computer science lover at core. So, I am interested in the application of technology really (yawwn...so cliched I know!) but with a slight bend towards computation (wait...what? :)).

Basically, for long, I have tried to find the middle ground in the extreme choices that I thought I faced. Finally, I have my answer, the field of Analytics is for me, as a career or as a hobby.

Thursday, March 19, 2015

What's with statisticians and matrices starting with letter C?

In preliminary or even in exploratory data analysis, you know, when you are just trying to understand the data at hand, you find the covariance when you want to check for (linear) relationship between two variables. No wait, wasn't that correlation? Where's my middle school stats book?

Ok, so you need to know your C's if you want to get anywhere with stats.


Covariance vs. Correlation:

Both covariance and correlation tell us if variables are positively or negatively related. A positive value indicates that the variables move together in the same direction (ie both increase or decrease together or slope is +ve). A negative value indicated that the variables move in opposite directions (ie when one increases, other decreases or slope is negative.
Correlation also tells you the strength of this relationship i.e. it gives you the degree to which the two variables move together. It always takes a value between -1 and 1, with Zero meaning No Relation.

So, how?


1) Variance: measure of dispersion within a variable (This is the same variance that we study with SD or Standard Deviation where Variance = Square(SD))


Variance = Var(x) = σ² = 1/n∑(x-xₘ)²         where x = values in the variable, xₘ = mean of x values


2) Covariance: how two variables x and y are related (Each x and y have multiple and same number of values because each is a vector or set of observations)


Covariance = Cov(x,y) = σxy = 1/n∑(x-xₘ)(y-yₘ)  


3) Correlation: how two variables are related and to what degree (between -1 and 1)


Pearson's Correlation coefficient = Cov(x,y) / sqrt(Var(x). Var(y))

   
When there are more dimensions, you can create matrices for covariance and correlation. 

Covariance vs. Correlation matrices:



Covariance matrix
Correlation matrix
Var(x1)         Cov(x1,x2)  ……    Cov(x1,xc)
Cov(x2,x1)     Var(x2)      ……    Cov(x2,xc)
Cov(xc,x1)     Cov(xc,x2) ……     Var(xc)
      1             Corr(x1,x2) ...  Corr(x1,xc)
Corr(x2,x1)          1          ...  Corr(x2,xc)
Corr(xc,x1)     Corr(xc,x2)  …     1

Confusion matrix: While we are on C's and matrices, lets also mention the "Confusion matrix" or "Classification matrix", which is basically a table that contains counts of actual vs. predicted classifications. E.g. if you have n observations (records) that you classify in say 2 classes say 1 and 0, the confusion matrix would be.

This is not used for finding relationship between variables but rather to check the accuracy of models in classification problems.

1 = Success
Predicted 1
Predicted 0
Total
Actual 1
True Positive = a
False Negative = b
a + b
Actual 0
False Positive = c
True Negative = d
c + d
Total
a + c
b + d
a+b+c+d=n

I also saw a different version of the confusion matrix in hypothesis testing scenarios.

Population / Decision
Population
Ho True
Ha True
Decision
Accept Ho
Correct decision
Type II Error
Reject Ho
Type I Error
Correct decision