Correlation Coefficient – Explanation and Examples

The correlation coefficient of a set of data is a number between $-1$ and $1$ that shows how random the data is.

A number closer to $0$ indicates randomness. A number closer to $1$ indicates a positive correlation while a number closer to $-1$ indicates a negative correlation.

Correlation coefficients are important for any kind of statistical analysis. The higher the absolute value of the correlation coefficient, the stronger the association between to variables.

This section covers:

  • What is a Correlation Coefficient?
  • Correlation Coefficient Definition
  • How to find Correlation Coefficient

What is a Correlation Coefficient?

A correlation coefficient is a number that shows how strongly two variables are associated. The closer the absolute value of the coefficient is to 1, the stronger the association between the two variables.

Specifically, values closer to $1$ indicate a strong positive association, while values closer to $-1$ indicate a strong negative association. That is, when the value is closer to $1$, the value of the dependent variable will increase as the independent variable increases. The opposite is true when the correlation coefficient is closer to $-1$.

When the correlation coefficient is closer to $0$, it indicates a lack of association between the two variables.

A correlation coefficient greater than $0.8$ or less than $-0.8$ is generally considered significant.

Correlation Coefficient Definition

A correlation coefficient is a number between $1$ and $-1$ that shows how associated two variables are. Usually this number is denoted as $r$.

Random data will have values closer to $0$, proportional data will have values closer to $1$, and inversely proportional data will have values closer to $-1$.

How to Find Correlation Coefficient

The correlation coefficient is meaningful for bivariate quantitative data. That is, when the data consists of two numerical values. For example, height and shoe size or temperature and humidity are bivariate quantitative data.

The correlation coefficient shows whether or not the data have a linear relationship.

Calculating this number by hand is certainly possible, but it does take a significant amount of time, especially as the number of data points increases.

To calculate $r$ for bivariate data with independent variable $x$ and dependent variable $y$:

  1. Calculate the mean of all the $x$ values, $\bar{x}$.
    If there are $n$ data points, the mean is $\bar{x}=\frac{\sum\limits_{k=1}^n x_k}{n}$. That is, the sum of all the $x$ terms divided by the total number of terms.
  2. Calculate the mean of all the $y$ values, $\bar{y}$.
    If there are $n$ data points, the mean is $\bar{y}=\frac{\sum\limits_{k=1}^n y_k}{n}$. That is, the sum of all the $y$ terms divided by the total number of terms.
  3. Calculate the standard deviation of all the $x$ terms, $s_x$.
    The standard deviation is $s_x$=$\sqrt{\frac{\sum\limits_{k=1}^n (x_k-\bar{x})^2}{n-1}}$. This is a complicated looking formula, but it simply finds how much the typical data point deviates from the mean.
  4. Calculate the standard deviation of all the $y$ terms, $s_y$.
    The standard deviation is $s_y$=$\sqrt{\frac{\sum\limits_{k=1}^n (y_k-\bar{y})^2}{n-1}}$. Again, this is a complicated looking formula, but remember it just finds how much the typical data point deviates from the mean.
  5. Calculate the $z$-score for the $x$ terms. The $z$-score (also known as the standard value) is equal to $\frac{x-\bar{x}}{s_x}$. This number makes data from different samples easy to compare.
  6. Similarly, calculate the $z$-score for the $y$ terms. This is equal to $\frac{y-\bar{y}}{s_y}$.
  7. Finally, calculate the correlation coefficient $r$ as the sum of the products of corresponding $z$-scores divided by $n-1$. That is, multiply the $z$-score of each $x$ value by the $z$-score of the corresponding $y$ value. Then, add together those products and divide by $1$ less than the total number of terms.
    Put another way, $r=$\frac{\sum\limits_{k=1}^n z_{x_k}z_{y_k}}{n-1}$.

Linear Regressions

Correlation coefficients are related to linear regressions. These are sometimes called “lines of best fit.” That is, they are a line that best approximates the data.

The correlation coefficient shows how well the regression line fits the data. A larger absolute value of the correlation coefficient indicates a better fit.

In fact, the correlation coefficient for data that perfectly fits the line of best fit will be $1$ or $-1$ (depending whether the line has a positive or negative slope). A large sample of truly random data, on the other hand, will have a value very close to $0$.

History of Correlation Coefficient

Examples

This section covers common problems using properties of equality and their step-by-step solutions.

Example 1

Use the 7-step method above to prove that the set of data shown (for which every point falls on the positive linear regression line) has a correlation coefficient of $1$.

Solution

The points shown all fall on the line. It is required to show that the correlation coefficient, $r$, for this data set is $1$.

First, determine the $x$ and $y$ values for each point. Notice that since the points fall on the line, the $y$ value is easy to calculate given the $x$ value.

Thus, the actual first step in this case requires finding the equation for the line. It passes through the point $(0, 1)$, which is the $y$ intercept. It also passes through the point $(6, 6)$.

Therefore, the slope is $\frac{y_1-y_2}{x_1-x_2}$. In this case, $m= \frac{6-0}{6-1} = \frac{5}{6}$.

Then the equation of the line in slope-intercept form is $y=mx+b$, which is $y=\frac{5}{6}x+1$.

The given $x$ points are $(0, 1), (3, 3.5), (6, 6), (7, \frac{41}{6}),$ and $(10, \frac{56}{6})$.

Therefore, the mean of the $x$-values is $\frac{ 0+3+6+7+10 }{5} = \frac{26}{5}$.

Similarly, the mean of the $y$-values is $\frac{ 1+3.5+6+\frac{41}{6}+\frac{56}{6} }{5} = \frac{10.5+ \frac{97}{6}}{5}$. This simplifies to $\frac{ \frac{ 63 }{6} + \frac{ 97 }{6} }{5} = \frac{160}{6} \times \frac{1}{5}$. Finally, this further simplifies to $\frac{ 32 }{6}$ or $\frac{16}{3}$.

Standard Deviation of $x$

Now, it is required to find the standard deviation of the $x$ values, $s_x$. That requires finding the difference between each of the $x$ terms and $\bar{x}$.

$(0-\frac{26}{5})^2 = \frac{ 676 }{25}$.

$(3-\frac{26}{5})^2 = \frac{ 121 }{25}$.

$(6-\frac{26}{5})^2 = \frac{ 16 }{25}$.

$(7-\frac{26}{5})^2 = \frac{ 81 }{25}$.

$(10-\frac{26}{5})^2 = \frac{ 576 }{25}$.

Now, recall $s_x$=$\sqrt{\frac{\sum\limits_{k=1}^n (x_k-\bar{x})^2}{n-1}}$. In this case, that is:

$s_x$=$\sqrt{\frac { 676 }{25} + \frac{121}{25} + \frac{16}{25} + \frac{81}{25} + \frac{576}{25}}{4}}$.

This simplifies to $\sqrt{\frac{ 147 }{10}} \approx 3.83$.

The z-scores are therefore:

$\frac{ 0- \frac{ 26 }{5} }\\sqrt{frac{ { 147 }{10}}} \approx -1.36$.

$\frac{ 3- \frac{ 26 }{5} }\\sqrt{frac{ { 147 }{2}}} \approx -1.20$.

$\frac{ 6- \frac{ 26 }{5} }\\sqrt{frac{ { 147 }{2}}} \approx -1.04$.

$\frac{ 7- \frac{ 26 }{5} }\\sqrt{frac{ { 147 }{2}}} \approx -0.99$.

$\frac{ 10- \frac{ 26 }{5} }\\sqrt{frac{ { 147 }{2}}} \approx -0.83$.

Standard Deviation of $y$

Similarly, calculate the standard deviation of $y$ as

$(0-\frac{26}{5})^2 = \frac{

Example 2

Find the z-scores for the following data points.

Solution

Example 3

Find the correlation coefficient for this set of data.

Solution

Example 4

Interpret a correlation coefficient in context.

Solution

Example 5

Use a calculator to find the correlation coefficient. Then, interpret the correlation coefficient in context.

Solution

Practice Problems

Answer Key

Rate this page
Previous Lesson

Chebyshev’s Theorem – Explanation & Examples

Next Lesson

Tautology

We Create New Version

Coming Soon...