16.2 C
London
Thursday, July 4, 2024
HomeStatistics TutorialStatologyHow to Perform One-Hot Encoding in R

How to Perform One-Hot Encoding in R

Related stories

Learn About Opening an Automobile Repair Shop in India

Starting a car repair shop is quite a good...

Unlocking the Power: Embracing the Benefits of Tax-Free Investing

  Unlocking the Power: Embracing the Benefits of Tax-Free Investing For...

Income Splitting in Canada for 2023

  Income Splitting in Canada for 2023 The federal government’s expanded...

Can I Deduct Home Office Expenses on my Tax Return 2023?

Can I Deduct Home Office Expenses on my Tax...

Canadian Tax – Personal Tax Deadline 2022

  Canadian Tax – Personal Tax Deadline 2022 Resources and Tools...

One-hot encoding is used to convert categorical variables into a format that can be used by machine learning algorithms.

The basic idea of one-hot encoding is to create new variables that take on values 0 and 1 to represent the original categorical values.

For example, the following image shows how we would perform one-hot encoding to convert a categorical variable that contains team names into new variables that contain only 0 and 1 values:

The following step-by-step example shows how to perform one-hot encoding for this exact dataset in R.

Step 1: Create the Data

First, let’s create the following data frame in R:

#create data frame
df frame(team=c('A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'),
                 points=c(25, 12, 15, 14, 19, 23, 25, 29))

#view data frame
df

  team points
1    A     25
2    A     12
3    B     15
4    B     14
5    B     19
6    B     23
7    C     25
8    C     29

Step 2: Perform One-Hot Encoding

Next, let’s use the dummyVars() function from the caret package to perform one-hot encoding on the ‘team’ variable in the data frame:

library(caret)

#define one-hot encoding function
dummy  ~ .", data=df)

#perform one-hot encoding on data frame
final_df frame(predict(dummy, newdata=df))

#view final data frame
final_df

  teamA teamB teamC points
1     1     0     0     25
2     1     0     0     12
3     0     1     0     15
4     0     1     0     14
5     0     1     0     19
6     0     1     0     23
7     0     0     1     25
8     0     0     1     29 

Notice that three new columns were added to the data frame since the original ‘team’ column contained three unique values.

Also notice that the original ‘team’ column was dropped from the data frame since it’s no longer needed.

The one-hot encoding is complete and we can now feed this dataset into any machine learning algorithm that we’d like.

Note: You can find the complete online documentation for the dummyVars() function here.

Additional Resources

The following tutorials offer additional information about working with categorical variables:

How to Create Categorical Variables in R
How to Plot Categorical Data in R
Categorical vs. Quantitative Variables: What’s the Difference?

Subscribe

- Never miss a story with notifications

- Gain full access to our premium content

- Browse free from up to 5 devices at once

Latest stories