In statistics, multicollinearity occurs when two or more predictor variables are highly correlated with each other, such that they do not provide unique or independent information in the regression model.
If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the regression model.
The most extreme case of multicollinearity is known as perfect multicollinearity. This occurs when at least two predictor variables have an exact linear relationship between them.
For example, suppose we have the following dataset:
Notice that the values for predictor variable x2 are simply the values of x1 multiplied by 2.
This is an example of perfect multicollinearity.
The Problem with Perfect Multicollinearity
When perfect multicollinearity is present in a dataset, the method of ordinary least squares is unable to produce estimates for regression coefficients.
This is because it’s not possible to estimate the marginal effect of one predictor variable (x1) on the response variable (y) while holding another predictor variable (x2) constant because x2 always moves exactly when x1 moves.
In short, perfect multicollinearity makes it impossible to estimate a value for every coefficient in a regression model.
How to Handle Perfect Multicollinearity
The simplest way to handle perfect multicollinearity is to drop one of the variables that has an exact linear relationship with another variable.
For example, in our previous dataset we could simply drop x2 as a predictor variable.
We would then fit a regression model using x1 as a predictor variable and y as the response variable.
Examples of Perfect Multicollinearity
The following examples show the three most common scenarios of perfect multicollinearity in practice.
1. One Predictor Variable is a Multiple of Another
Suppose we want to use “height in centimeters” and “height in meters” to predict the weight of a certain species of dolphin.
Here’s what our dataset might look like:
Notice that the value for “height in centimeters” is simply equal to “height in meters” multiplied by 100. This is a case of perfect multicollinearity.
If we attempt to fit a multiple linear regression model in R using this dataset, we won’t be able to produce a coefficient estimate for the “meters” predictor variable:
#define data df frame(weight=c(400, 460, 470, 475, 490, 440, 430, 490, 500, 540), m=c(1.3, .7, .6, 1.3, 1.2, 1.5, 1.2, 1.6, 1.1, 1.4), cm=c(130, 70, 60, 130, 120, 150, 120, 160, 110, 140)) #fit multiple linear regression model model #view summary of model summary(model) Call: lm(formula = weight ~ m + cm, data = df) Residuals: Min 1Q Median 3Q Max -70.501 -25.501 5.183 19.499 68.590 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 458.676 53.403 8.589 2.61e-05 *** m 9.096 43.473 0.209 0.839 cm NA NA NA NA --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 41.9 on 8 degrees of freedom Multiple R-squared: 0.005442, Adjusted R-squared: -0.1189 F-statistic: 0.04378 on 1 and 8 DF, p-value: 0.8395
2. One Predictor Variable is a Transformed Version of Another
Suppose we want to use “points” and “scaled points” to predict the rating of basketball players.
Let’s assume that the variable “scaled points” is calculated as:
Scaled points = (points – μpoints) / σpoints
Here’s what our dataset might look like:
Notice that each value for “scaled points” is simply a standardized version of “points.” This is a case of perfect multicollinearity.
If we attempt to fit a multiple linear regression model in R using this dataset, we won’t be able to produce a coefficient estimate for the “scaled points” predictor variable:
#define data
df frame(rating=c(88, 83, 90, 94, 96, 78, 79, 91, 90, 82),
pts=c(17, 19, 24, 29, 33, 15, 14, 29, 25, 22))
df$scaled_pts #fit multiple linear regression model
model #view summary of model
summary(model)
Call:
lm(formula = rating ~ pts + scaled_pts, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.4932 -1.3941 -0.2935 1.3055 5.8412
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.4218 3.5896 18.783 6.67e-08 ***
pts 0.8669 0.1527 5.678 0.000466 ***
scaled_pts NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.953 on 8 degrees of freedom
Multiple R-squared: 0.8012, Adjusted R-squared: 0.7763
F-statistic: 32.23 on 1 and 8 DF, p-value: 0.0004663
3. The Dummy Variable Trap
Another scenario where perfect multicollinearity can occur is known as the dummy variable trap. This is when we want to use a categorical variable in a regression model and convert it into a “dummy variable” that takes on values of 0, 1, 2, etc.
For example, suppose we would like to use predictor variables “age” and “marital status” to predict income:
To use “marital status” as a predictor variable, we need to first convert it to a dummy variable.
To do so, we can let “Single” be our baseline value since it occurs most often and assign values of 0 or 1 to “Married” and “Divorce” as follows:
A mistake would be to create three new dummy variables as follows:
In this case, the variable “Single” is a perfect linear combination of the “Married” and “Divorced” variables. This is an example of perfect multicollinearity.
If we attempt to fit a multiple linear regression model in R using this dataset, we won’t be able to produce a coefficient estimate for every predictor variable:
#define data df frame(income=c(45, 48, 54, 57, 65, 69, 78, 83, 98, 104, 107), age=c(23, 25, 24, 29, 38, 36, 40, 59, 56, 64, 53), single=c(1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0), married=c(0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1), divorced=c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0)) #fit multiple linear regression model model #view summary of model summary(model) Call: lm(formula = income ~ age + single + married + divorced, data = df) Residuals: Min 1Q Median 3Q Max -9.7075 -5.0338 0.0453 3.3904 12.2454 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 16.7559 17.7811 0.942 0.37739 age 1.4717 0.3544 4.152 0.00428 ** single -2.4797 9.4313 -0.263 0.80018 married NA NA NA NA divorced -8.3974 12.7714 -0.658 0.53187 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.391 on 7 degrees of freedom Multiple R-squared: 0.9008, Adjusted R-squared: 0.8584 F-statistic: 21.2 on 3 and 7 DF, p-value: 0.0006865
Additional Resources
A Guide to Multicollinearity & VIF in Regression
How to Calculate VIF in R
How to Calculate VIF in Python
How to Calculate VIF in Excel