Correlation Tutorial

4223 days ago by comphy

Correlation

Correlation is a measure of the linear relationship between two random variables.  I find that it is helpful to understand correlation as a sort of angle between two vectors of data.  

Notice that the sample covariance is the dot product of the data vectors.  The sample standard deviation, or standard error as it is called, is by the same analogy the norm of the sample.

$$\mbox{Let }\{X_i\}, \{Y_i\}\mbox{ be two samples of observations, and let} \vec{X}=\langle X_1, \dots, X_n\rangle, \  \vec{Y}=\langle Y_1, \dots, Y_n\rangle$$

$$Cov(X, Y)=\sum_{i=1}^n (X-\bar{X})(Y-\bar{Y}) = \vec{X}\cdot\vec{Y}$$

$$Var(X, Y)=\sum_{i=1}^n (X-\bar{X})^2 = \|\vec{X}\|^2\mbox{ and }\sigma_X=\|\vec{X}\|$$

$$Corr(X,Y)=\frac{Cov(X,Y)}{\sigma_X\sigma_Y}=\frac{\vec{X}\cdot\vec{Y}}{\|\vec{X}\|\cdot\|\vec{Y}\|}=\frac{\|\vec{X}\|\cdot\|\vec{Y}\|\cdot\cos\theta}{\|\vec{X}\|\cdot\|\vec{Y}\|}=\cos\theta$$

Where θ is the hypothetical angle between the two sets of data.

Instructions for demo:

The demo is divided into three interactive boxes so that it is possible to change the parameters without resampling.  These are recognized gray backgrounds, in contrast to white backgrounds that hold the input commands.

First box: The explanatory variable
This is where x, the explanatory variable, is set up. You may choose its distribution.  Koop suggests a uniform distribution, but the normal is included here as well.  Changing the distribution automatically refreshes the data, but it is possible to refresh is as well by presing the X. Note: Any changes made in boxs one or two will not be reflected until the graph is redrawn. Changing the sample size requires the the noise be refreshed as well.

Second box: The noise, or error term
This button resamples the variable ep (short for epsilon).  This variable is standard normal, but is scaled in the third box, below.

Third Box: The plot thickens!
This is where the dynamic experimentation begins.  Keeping the sample as it is, you are able to manipulate the slope of the dependence ($\beta$), its intercept ($\alpha$), and the standard deviation of the distribution of the noise.  The latter is the reciprocal of what Koop terms "precision".

$$y=\alpha + \beta x +(noise)\epsilon$$

$$\epsilon \sim N(0,1)$$

Some things to notice: 

  • Correlation varies with the ratio of beta to noise, but not linearly
  • A different noise or X sample will change the correlation.  Does this effect lessen as sample size increases?
  • Changing X from uniform to normal will tend to change the correlation significantly.
  • Changing X's distribution will also change the shape of the scattering about the regression line, with the normal producing more of the classic elliptical pattern.  The latter may be truer to sample observations, while the uniform (looking more like a road) may better represent controlled observations.

NOTE: The code can be interactively editted beyond the parameters established by the interactive controls.  The code may be re-evaluated by shift+enter

import numpy @interact def _A(dist=selector(['normal', 'uniform'], label='Distribution for explanatory variable:'), s = selector(['X'], label='Refresh Explanatory Variable', buttons=True), Sample_Size=slider(100, 10000, 100, 1000)): global x global N global ep N=Sample_Size if dist=='uniform': x=numpy.random.rand(N) if dist=='normal': x=numpy.random.randn(N) ep=numpy.random.randn(N) 
       

Click to the left again to hide and once more to show the dynamic interactive window

@interact def _B(s = selector(['Refresh Noise'], label='Refresh Noise', buttons=True)): global ep ep=numpy.random.randn(N) 
       

Click to the left again to hide and once more to show the dynamic interactive window

@interact def _C(noise=slider(0,2,.05), beta=slider(-3,3,.05, 1), alpha=slider(-2,2,.1,0), s=selector(['Redraw'], buttons=True, label='')): y=beta*x+noise*ep+alpha C=numpy.cov(x,y) c=numpy.corrcoef(x,y) M=x.max() m=x.min() fig=scatter_plot(zip(x,y), markersize=5, marker='x', edgecolor='green') fig+=line([(m-.2,beta*(m-.2)+alpha), (M+.2, beta*(M+.2)+alpha)], color='yellow', thickness='2') print('Correlation is %s' %c[0][1]) print('Covariance is %s' %C[0][1]) print('Standard error of x: %s' %sqrt(C[0][0])) print('Standard error of y: %s' %sqrt(C[1][1])) show(fig) 
       

Click to the left again to hide and once more to show the dynamic interactive window