Thursday, March 19, 2015

What's with statisticians and matrices starting with letter C?

In preliminary or even in exploratory data analysis, you know, when you are just trying to understand the data at hand, you find the covariance when you want to check for (linear) relationship between two variables. No wait, wasn't that correlation? Where's my middle school stats book?

Ok, so you need to know your C's if you want to get anywhere with stats.


Covariance vs. Correlation:

Both covariance and correlation tell us if variables are positively or negatively related. A positive value indicates that the variables move together in the same direction (ie both increase or decrease together or slope is +ve). A negative value indicated that the variables move in opposite directions (ie when one increases, other decreases or slope is negative.
Correlation also tells you the strength of this relationship i.e. it gives you the degree to which the two variables move together. It always takes a value between -1 and 1, with Zero meaning No Relation.

So, how?


1) Variance: measure of dispersion within a variable (This is the same variance that we study with SD or Standard Deviation where Variance = Square(SD))


Variance = Var(x) = σ² = 1/n∑(x-xₘ)²         where x = values in the variable, xₘ = mean of x values


2) Covariance: how two variables x and y are related (Each x and y have multiple and same number of values because each is a vector or set of observations)


Covariance = Cov(x,y) = σxy = 1/n∑(x-xₘ)(y-yₘ)  


3) Correlation: how two variables are related and to what degree (between -1 and 1)


Pearson's Correlation coefficient = Cov(x,y) / sqrt(Var(x). Var(y))

   
When there are more dimensions, you can create matrices for covariance and correlation. 

Covariance vs. Correlation matrices:



Covariance matrix
Correlation matrix
Var(x1)         Cov(x1,x2)  ……    Cov(x1,xc)
Cov(x2,x1)     Var(x2)      ……    Cov(x2,xc)
Cov(xc,x1)     Cov(xc,x2) ……     Var(xc)
      1             Corr(x1,x2) ...  Corr(x1,xc)
Corr(x2,x1)          1          ...  Corr(x2,xc)
Corr(xc,x1)     Corr(xc,x2)  …     1

Confusion matrix: While we are on C's and matrices, lets also mention the "Confusion matrix" or "Classification matrix", which is basically a table that contains counts of actual vs. predicted classifications. E.g. if you have n observations (records) that you classify in say 2 classes say 1 and 0, the confusion matrix would be.

This is not used for finding relationship between variables but rather to check the accuracy of models in classification problems.

1 = Success
Predicted 1
Predicted 0
Total
Actual 1
True Positive = a
False Negative = b
a + b
Actual 0
False Positive = c
True Negative = d
c + d
Total
a + c
b + d
a+b+c+d=n

I also saw a different version of the confusion matrix in hypothesis testing scenarios.

Population / Decision
Population
Ho True
Ha True
Decision
Accept Ho
Correct decision
Type II Error
Reject Ho
Type I Error
Correct decision

Wednesday, March 18, 2015

Mis-statistics

Original article: http://tvtropes.org/pmwiki/pmwiki.php/Main/LiesDamnedLiesAndStatistics

We are surrounded by stats. Advertisers and Marketers have used, abused and misused statistics in mischievously creative ways and mostly gotten away with it. I suddenly had this urge to look up more well known examples and though not surprised, I am definitely more amused. Next time anyone throws a number at me, they better be well-prepared and telling the truth. :)

The original article talks about many examples that re-instate that statistics are just numbers that mean nothing without context; I am just picking two that I found amusing. So, read on...

This well-known saying is part of a phrase often attributed to Benjamin Disraeli and popularized in the U.S. by Mark Twain:

"There are three kinds of falsehoods: lies, damned lies, and statistics."

1) "You are more likely to die on the toilet than be eaten by a shark." When you compare how much time you spend around sharks versus how much time you spend around toilets ... really, the toilet has time to plan out its move in advance. 

  • Same deal with most accidents occurring in the home. Considering that you spend the majority of your time in your home, this should come as no surprise to anyone.
  • The same for the example above about most vehicular accidents occurring near the home (some say "within 25 miles from your home"). This is because most people do most of their driving near their homes, not that the home or the surrounding area is more dangerous than areas distant from the home.
  • At some Reform Judaism synagogues, a popular "joke" to lead into the sermon is, "x% of deaths occur in a hospital, x% of deaths occur in a car, x% of deaths happen in the home...[continues on for a while] while there have been only three deaths in a synagogue, and no deaths ever reported while studying Torah! Clearly, the safest passion, therefore, is studying Torah."
2) Nine out of Ten Doctors Agree that the phrase "Nine out of Ten Doctors Agree" has been practically a stock phrase in advertising since the early 20th century. 
  • "Nine out of ten dentists recommend Trident for their patients who chew gum." The tenth dentist was insistent that his patients never chew gum at all, but surprisingly, Trident didn't want you to know about that.
  • One interesting case happened in Portugal, where two ads were being broadcasted on national TV during the same period (and sometimes even in the same commercial break) claiming, respectively, that '90% of dentists use toothpaste X' and '8 out of 10 dentists recommend toothpaste Y to their family'. Together, if you stop to think about it, they imply something is not quite right about those professionals' concern over their own family...
  • Or that an awful lot of dentists are unmarried orphans, hence can't recommend it to a family they haven't got.
  • In a similar vein, a commercial for Five Hour Energy states that 4 out of 5 doctors wish for their patients who use energy supplements to use low-calorie energy supplements. Think about that: They specify patients who already use energy supplements, meaning they didn't count any doctors who recommend that their patients not use energy supplements at all.