Often in the field of statistics we’re interested in using data for one of two reasons:
(1) Inference: We want to understand the nature of the relationship between the predictor variables and the response variable in an existing dataset.
(2) Prediction: We want to use an existing dataset to build a model that predicts the value of the response variable of a new observation.
For example, suppose we have the following dataset that contains information about houses:
An example of inference:
Suppose we build a multiple linear regression model using square feet, number of bedrooms, and number of bathrooms as predictor variables and price as the response variable.
We could then use the regression coefficients to understand the average change in price associated with a one unit change in each of the predictor variables.
For example, we could understand how much price changes (on average) with each additional bedroom, each additional bathroom, and each additional square foot.
An example of prediction:
We could build the same multiple linear regression model and use it to predict how much a new home will be worth based on its square footage, number of bedrooms, and number of bathrooms.
For example, we could use the model to predict the price of a new home that has 3 bedrooms, 3 bathrooms, and 2,000 square feet.
We could then compare our prediction with the actual listing price and assess whether or not the home appears to be under- or over-valued.
The following examples illustrate the difference between inference and prediction in different scenarios:
Example 1: Inference vs. Prediction in Sports
Suppose we have the following dataset that contains information about professional basketball teams:
An example of inference:
Suppose we build a multiple linear regression model using points, rebounds, and assists as predictor variables and wins as the response variable.
We could then use the model to understand how much the number of wins changes (on average) with each additional point, rebound, and assist.
An example of prediction:
We could build the same multiple linear regression model and use it to predict how many wins a team will have based on their number of points, rebounds, and assists.
For example, we could use the model to predict the number of wins that a team with 90 points, 40 rebounds, and 30 assists will have.
Example 2: Inference vs. Prediction in Business
Suppose we have the following dataset that contains information about annual revenue (in millions) for various businesses:
An example of inference:
Suppose we build a multiple linear regression model using advertising spend, number of employees, and total acquisitions as predictor variables and annual revenue as the response variable.
We could then use the model to understand how much the total annual revenue changes (on average) with each additional dollar spent on advertising, each additional employee, and each additional acquisition.
An example of prediction:
We could build the same multiple linear regression model and use it to predict the annual revenue of a business based on their total marketing spend, number of employees, and total acquisitions.
For example, we could use the model to predict the annual revenue of a business that spends $25 million on advertising, has 40 employees, and has had 2 acquisitions.
Additional Resources
The following tutorials offer additional information about important terms to understand in statistics:
Descriptive vs. Inferential Statistics: What’s the Difference?
Levels of Measurement: Nominal, Ordinal, Interval and Ratio
Qualitative vs. Quantitative Variables: What’s the Difference?