15.1 C
London
Friday, July 5, 2024
HomePythonDescriptive Statistics in PythonHow to Perform a Box-Cox Transformation in Python

How to Perform a Box-Cox Transformation in Python

Related stories

Learn About Opening an Automobile Repair Shop in India

Starting a car repair shop is quite a good...

Unlocking the Power: Embracing the Benefits of Tax-Free Investing

  Unlocking the Power: Embracing the Benefits of Tax-Free Investing For...

Income Splitting in Canada for 2023

  Income Splitting in Canada for 2023 The federal government’s expanded...

Can I Deduct Home Office Expenses on my Tax Return 2023?

Can I Deduct Home Office Expenses on my Tax...

Canadian Tax – Personal Tax Deadline 2022

  Canadian Tax – Personal Tax Deadline 2022 Resources and Tools...

box-cox transformation is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.

The basic idea behind this method is to find some value for λ such that the transformed data is as close to normally distributed as possible, using the following formula:

  • y(λ) = (yλ – 1) / λ  if y ≠ 0
  • y(λ) = log(y)  if y = 0

We can perform a box-cox transformation in Python by using the scipy.stats.boxcox() function.

The following example shows how to use this function in practice.

Example: Box-Cox Transformation in Python

Suppose we generate a random set of 1,000 values that come from an exponential distribution:

#load necessary packages
import numpy as np 
from scipy.stats import boxcox 
import seaborn as sns 

#make this example reproducible
np.random.seed(0)

#generate dataset
data = np.random.exponential(size=1000)

#plot the distribution of data values
sns.distplot(data, hist=False, kde=True) 

We can see that the distribution does not appear to be normal.

We can use the boxcox() function to find an optimal value of lambda that produces a more normal distribution:

#perform Box-Cox transformation on original data
transformed_data, best_lambda = boxcox(data) 

#plot the distribution of the transformed data values
sns.distplot(transformed_data, hist=False, kde=True) 

Box-cox transformation in Python

We can see that the transformed data follows much more of a normal distribution.

We can also find the exact lambda value used to perform the Box-Cox transformation:

#display optimal lambda value
print(best_lambda)

0.2420131978174143

The optimal lambda was found to be roughly 0.242.

Thus, each data value was transformed using the following equation:

New = (old0.242 – 1) / 0.242

We can confirm this by looking at the values from the original data compared to the transformed data:

#view first five values of original dataset
data[0:5]

array([0.79587451, 1.25593076, 0.92322315, 0.78720115, 0.55104849])

#view first five values of transformed dataset
transformed_data[0:5]

array([-0.22212062,  0.23427768, -0.07911706, -0.23247555, -0.55495228])

The first value in the original dataset was 0.79587. Thus, we applied the following formula to transform this value:

New = (.795870.242 – 1) / 0.242 = -0.222

We can confirm that the first value in the transformed dataset is indeed -0.222.

Additional Resources

How to Create & Interpret a Q-Q Plot in Python
How to Perform a Shapiro-Wilk Test for Normality in Python

Subscribe

- Never miss a story with notifications

- Gain full access to our premium content

- Browse free from up to 5 devices at once

Latest stories