2.4 C
London
Friday, December 20, 2024
HomePythonDescriptive Statistics in PythonHow to Calculate Correlation in Python

How to Calculate Correlation in Python

Related stories

Learn About Opening an Automobile Repair Shop in India

Starting a car repair shop is quite a good...

Unlocking the Power: Embracing the Benefits of Tax-Free Investing

  Unlocking the Power: Embracing the Benefits of Tax-Free Investing For...

Income Splitting in Canada for 2023

  Income Splitting in Canada for 2023 The federal government’s expanded...

Can I Deduct Home Office Expenses on my Tax Return 2023?

Can I Deduct Home Office Expenses on my Tax...

Canadian Tax – Personal Tax Deadline 2022

  Canadian Tax – Personal Tax Deadline 2022 Resources and Tools...

One way to quantify the relationship between two variables is to use the Pearson correlation coefficient, which is a measure of the linear association between two variablesIt always takes on a value between -1 and 1 where:

  • -1 indicates a perfectly negative linear correlation between two variables
  • 0 indicates no linear correlation between two variables
  • 1 indicates a perfectly positive linear correlation between two variables

The further away the correlation coefficient is from zero, the stronger the relationship between the two variables.

This tutorial explains how to calculate the correlation between variables in Python.

How to Calculate Correlation in Python

To calculate the correlation between two variables in Python, we can use the Numpy corrcoef() function.

import numpy as np

np.random.seed(100)

#create array of 50 random integers between 0 and 10
var1 = np.random.randint(0, 10, 50)

#create a positively correlated array with some random noise
var2 = var1 + np.random.normal(0, 10, 50)

#calculate the correlation between the two arrays
np.corrcoef(var1, var2)

[[ 1. 0.335]
[ 0.335 1. ]]

We can see that the correlation coefficient between these two variables is 0.335, which is a positive correlation.

By default, this function produces a matrix of correlation coefficients. If we only wanted to return the correlation coefficient between the two variables, we could use the following syntax:

np.corrcoef(var1, var2)[0,1]

0.335

To test if this correlation is statistically significant, we can calculate the p-value associated with the Pearson correlation coefficient by using the Scipy pearsonr() function, which returns the Pearson correlation coefficient along with the two-tailed p-value.

from scipy.stats.stats import pearsonr

pearsonr(var1, var2)

(0.335, 0.017398)

The correlation coefficient is 0.335 and the two-tailed  p-value is .017. Since this p-value is less than .05, we would conclude that there is a statistically significant correlation between the two variables.

If you’re interested in calculating the correlation between several variables in a Pandas DataFrame, you can simpy use the .corr() function.

import pandas as pd

data = pd.DataFrame(np.random.randint(0, 10, size=(5, 3)), columns=['A', 'B', 'C'])
data

  A B C
0 8 0 9
1 4 0 7
2 9 6 8
3 1 8 1
4 8 0 8

#calculate correlation coefficients for all pairwise combinations
data.corr()

          A         B         C
A  1.000000 -0.775567 -0.493769
B -0.775567  1.000000  0.000000
C -0.493769  0.000000  1.000000

And if you’re only interested in calculating the correlation between two specific variables in the DataFrame, you can specify the variables:

data['A'].corr(data['B'])

-0.775567

Additional Resources

The following tutorials explain how to perform other common tasks in Python:

How to Create a Correlation Matrix in Python
How to Calculate Spearman Rank Correlation in Python
How to Calculate Autocorrelation in Python

Subscribe

- Never miss a story with notifications

- Gain full access to our premium content

- Browse free from up to 5 devices at once

Latest stories