Jottings of an Analytics enthusiast: What's with statisticians and matrices starting with letter C?

In preliminary or even in exploratory data analysis, you know, when you are just trying to understand the data at hand, you find the covariance when you want to check for (linear) relationship between two variables. No wait, wasn't that correlation? Where's my middle school stats book?

Ok, so you need to know your C's if you want to get anywhere with stats.

Covariance vs. Correlation:
Both covariance and correlation tell us if variables are positively or negatively related. A positive value indicates that the variables move together in the same direction (ie both increase or decrease together or slope is +ve). A negative value indicated that the variables move in opposite directions (ie when one increases, other decreases or slope is negative.
Correlation also tells you the strength of this relationship i.e. it gives you the degree to which the two variables move together. It always takes a value between -1 and 1, with Zero meaning No Relation.

So, how?

1) Variance: measure of dispersion within a variable (This is the same variance that we study with SD or Standard Deviation where Variance = Square(SD))

Variance = Var(x) = σ² = 1/n∑(x-xₘ)²         where x = values in the variable, xₘ = mean of x values

2) Covariance: how two variables x and y are related (Each x and y have multiple and same number of values because each is a vector or set of observations)

Covariance = Cov(x,y) = σxy = 1/n∑(x-xₘ)(y-yₘ)

3) Correlation: how two variables are related and to what degree (between -1 and 1)

Pearson's Correlation coefficient = Cov(x,y) / sqrt(Var(x). Var(y))

When there are more dimensions, you can create matrices for covariance and correlation.

Covariance vs. Correlation matrices:

Covariance matrix	Correlation matrix
Var(x1) Cov(x1,x2) …… Cov(x1,xc) Cov(x2,x1) Var(x2) …… Cov(x2,xc) Cov(xc,x1) Cov(xc,x2) …… Var(xc)	1 Corr(x1,x2) ... Corr(x1,xc) Corr(x2,x1) 1 ... Corr(x2,xc) Corr(xc,x1) Corr(xc,x2) … 1

Confusion matrix: While we are on C's and matrices, lets also mention the "Confusion matrix" or "Classification matrix", which is basically a table that contains counts of actual vs. predicted classifications. E.g. if you have n observations (records) that you classify in say 2 classes say 1 and 0, the confusion matrix would be.
This is not used for finding relationship between variables but rather to check the accuracy of models in classification problems.

1 = Success	Predicted 1	Predicted 0	Total
Actual 1	True Positive = a	False Negative = b	a + b
Actual 0	False Positive = c	True Negative = d	c + d
Total	a + c	b + d	a+b+c+d=n

I also saw a different version of the confusion matrix in hypothesis testing scenarios.

Population / Decision		Population
Population / Decision		Ho True	Ha True
Decision	Accept Ho	Correct decision	Type II Error
Decision	Reject Ho	Type I Error	Correct decision

Jottings of an Analytics enthusiast

Thursday, March 19, 2015

What's with statisticians and matrices starting with letter C?

No comments: