11.8 C
London
Tuesday, May 20, 2025
HomeRDescriptive Statistics in RHow to Perform Data Binning in R (With Examples)

How to Perform Data Binning in R (With Examples)

Related stories

Learn About Opening an Automobile Repair Shop in India

Starting a car repair shop is quite a good...

Unlocking the Power: Embracing the Benefits of Tax-Free Investing

  Unlocking the Power: Embracing the Benefits of Tax-Free Investing For...

Income Splitting in Canada for 2023

  Income Splitting in Canada for 2023 The federal government’s expanded...

Can I Deduct Home Office Expenses on my Tax Return 2023?

Can I Deduct Home Office Expenses on my Tax...

Canadian Tax – Personal Tax Deadline 2022

  Canadian Tax – Personal Tax Deadline 2022 Resources and Tools...

You can use one of the following two methods to perform data binning in R:

Method 1: Use cut() Function

library(dplyr)

#perform binning with custom breaks
df %>% mutate(new_bin = cut(variable_name, breaks=c(0, 10, 20, 30)))

#perform binning with specific number of bins
df %>% mutate(new_bin = cut(variable_name, breaks=3))

Method 2: Use ntile() Function

library(dplyr)

#perform binning with specific number of bins
df %>% mutate(new_bin = ntile(variable_name, n=3))

The following examples show how to use each method in practice with the following data frame:

#create data frame
df frame(points=c(4, 4, 7, 8, 12, 13, 15, 18, 22, 23, 23, 25),
                 assists=c(2, 5, 4, 7, 7, 8, 5, 4, 5, 11, 13, 8),
                 rebounds=c(7, 7, 4, 6, 3, 8, 9, 9, 12, 11, 8, 9))

#view head of data frame
head(df)

  points assists rebounds
1      4       2        7
2      4       5        7
3      7       4        4
4      8       7        6
5     12       7        3
6     13       8        8

Example 1: Perform Data Binning with cut() Function

The following code shows how to perform data binning on the points variable using the cut() function with specific break marks:

library(dplyr)

#perform data binning on points variable
df %>% mutate(points_bin = cut(points, breaks=c(0, 10, 20, 30)))

   points assists rebounds points_bin
1       4       2        7     (0,10]
2       4       5        7     (0,10]
3       7       4        4     (0,10]
4       8       7        6     (0,10]
5      12       7        3    (10,20]
6      13       8        8    (10,20]
7      15       5        9    (10,20]
8      18       4        9    (10,20]
9      22       5       12    (20,30]
10     23      11       11    (20,30]
11     23      13        8    (20,30]
12     25       8        9    (20,30]

Notice that each row of the data frame has been placed in one of three bins based on the value in the points column.

We could also specify the number of breaks to use to create bins of equal width that range from the minimum value to the maximum value of the points column:

library(dplyr)

#perform data binning on points variable
df %>% mutate(points_bin = cut(points, breaks=3))

   points assists rebounds points_bin
1       4       2        7  (3.98,11]
2       4       5        7  (3.98,11]
3       7       4        4  (3.98,11]
4       8       7        6  (3.98,11]
5      12       7        3    (11,18]
6      13       8        8    (11,18]
7      15       5        9    (11,18]
8      18       4        9    (11,18]
9      22       5       12    (18,25]
10     23      11       11    (18,25]
11     23      13        8    (18,25]
12     25       8        9    (18,25]

Example 2: Perform Data Binning with ntile() Function

The following code shows how to perform data binning on the points variable using the ntile() function with a specific number of resulting bins:

library(dplyr)

#perform data binning on points variable
df %>% mutate(points_bin = ntile(points, n=3))

   points assists rebounds points_bin
1       4       2        7          1
2       4       5        7          1
3       7       4        4          1
4       8       7        6          1
5      12       7        3          2
6      13       8        8          2
7      15       5        9          2
8      18       4        9          2
9      22       5       12          3
10     23      11       11          3
11     23      13        8          3
12     25       8        9          3

Notice that each row has been assigned a bin from 1 to 3 based on the value of the points column.

It’s best to use the ntile() function when you’d like an integer value to be displayed in each row as opposed to an interval showing the range of the bin.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Replace Values in Data Frame Conditionally in R
How to Calculate a Trimmed Mean in R
How to Calculate Conditional Mean in R

Subscribe

- Never miss a story with notifications

- Gain full access to our premium content

- Browse free from up to 5 devices at once

Latest stories