0.4 C
London
Friday, March 14, 2025
HomeSoftware TutorialsPythonHow to Use Pandas Get Dummies – pd.get_dummies

How to Use Pandas Get Dummies – pd.get_dummies

Related stories

Learn About Opening an Automobile Repair Shop in India

Starting a car repair shop is quite a good...

Unlocking the Power: Embracing the Benefits of Tax-Free Investing

  Unlocking the Power: Embracing the Benefits of Tax-Free Investing For...

Income Splitting in Canada for 2023

  Income Splitting in Canada for 2023 The federal government’s expanded...

Can I Deduct Home Office Expenses on my Tax Return 2023?

Can I Deduct Home Office Expenses on my Tax...

Canadian Tax – Personal Tax Deadline 2022

  Canadian Tax – Personal Tax Deadline 2022 Resources and Tools...

Often in statistics, the datasets we’re working with include categorical variables.

These are variables that take on names or labels. Examples include:

  • Marital status (“married”, “single”, “divorced”)
  • Smoking status (“smoker”, “non-smoker”)
  • Eye color (“blue”, “green”, “hazel”)
  • Level of education (e.g. “high school”,  “Bachelor’s degree”, “Master’s degree”)

When fitting machine learning algorithms (like linear regression, logistic regression, random forests, etc.), we often convert categorical variables to dummy variables, which are numeric variables that are used to represent categorical data.

For example, suppose we have a dataset that contains the categorical variable Gender. To use this variable as a predictor in a regression model, we would first need to convert it to a dummy variable.

To create this dummy variable, we can choose one of the values (“Male”) to represent 0 and the other value (“Female”) to represent 1:

How to Create Dummy Variables in Pandas

To create dummy variables for a variable in a pandas DataFrame, we can use the pandas.get_dummies() function, which uses the following basic syntax:

pandas.get_dummies(data, prefix=None, columns=None, drop_first=False)

where:

  • data: The name of the pandas DataFrame
  • prefix: A string to append to the front of the new dummy variable column
  • columns: The name of the column(s) to convert to a dummy variable
  • drop_first: Whether or not to drop the first dummy variable column

The following examples show how to use this function in practice.

Example 1: Create a Single Dummy Variable

Suppose we have the following pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'income': [45, 48, 54, 57, 65, 69, 78],
                   'age': [23, 25, 24, 29, 38, 36, 40],
                   'gender': ['M', 'F', 'M', 'F', 'F', 'F', 'M']})

#view DataFrame
df

        income	age	gender
0	45	23	M
1	48	25	F
2	54	24	M
3	57	29	F
4	65	38	F
5	69	36	F
6	78	40	M

We can use the pd.get_dummies() function to turn gender into a dummy variable:

#convert gender to dummy variable
pd.get_dummies(df, columns=['gender'], drop_first=True)

	income	age	gender_M
0	45	23	1
1	48	25	0
2	54	24	1
3	57	29	0
4	65	38	0
5	69	36	0
6	78	40	1

The gender column is now a dummy variable where:

  • A value of 0 represents “Female”
  • A value of 1 represents “Male”

Example 2: Create Multiple Dummy Variables

Suppose we have the following pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'income': [45, 48, 54, 57, 65, 69, 78],
                   'age': [23, 25, 24, 29, 38, 36, 40],
                   'gender': ['M', 'F', 'M', 'F', 'F', 'F', 'M'],
                   'college': ['Y', 'N', 'N', 'N', 'Y', 'Y', 'Y']})

#view DataFrame
df

	income	age	gender	college
0	45	23	M	Y
1	48	25	F	N
2	54	24	M	N
3	57	29	F	N
4	65	38	F	Y
5	69	36	F	Y
6	78	40	M	Y

We can use the pd.get_dummies() function to convert gender and college both into dummy variables:

#convert gender to dummy variable
pd.get_dummies(df, columns=['gender', 'college'], drop_first=True)


        income	age	gender_M  college_Y
0	45	23	1	  1
1	48	25	0	  0
2	54	24	1	  0
3	57	29	0	  0
4	65	38	0	  1
5	69	36	0	  1
6	78	40	1	  1

The gender column is now a dummy variable where:

  • A value of 0 represents “Female”
  • A value of 1 represents “Male”

And the college column is now a dummy variable where:

  • A value of 0 represents “No” college
  • A value of 1 represents “Yes” college

Additional Resources

How to Use Dummy Variables in Regression Analysis
What is the Dummy Variable Trap?

Subscribe

- Never miss a story with notifications

- Gain full access to our premium content

- Browse free from up to 5 devices at once

Latest stories