15.1 C
London
Friday, July 5, 2024
HomeStatistics TutorialRHow to Use createDataPartition() Function in R

How to Use createDataPartition() Function in R

Related stories

Learn About Opening an Automobile Repair Shop in India

Starting a car repair shop is quite a good...

Unlocking the Power: Embracing the Benefits of Tax-Free Investing

  Unlocking the Power: Embracing the Benefits of Tax-Free Investing For...

Income Splitting in Canada for 2023

  Income Splitting in Canada for 2023 The federal government’s expanded...

Can I Deduct Home Office Expenses on my Tax Return 2023?

Can I Deduct Home Office Expenses on my Tax...

Canadian Tax – Personal Tax Deadline 2022

  Canadian Tax – Personal Tax Deadline 2022 Resources and Tools...

You can use the createDataPartition() function from the caret package in R to partition a data frame into training and testing sets for model building.

This function uses the following basic syntax:

createDataPartition(y, times = 1, p = 0.5, list = TRUE, …)

where:

  • y: vector of outcomes
  • times: number of partitions to create
  • p: percentage of data to use in training set
  • list: whether to store results in list or not

The following example shows how to use this function in practice.

Example: Using createDataPartition() in R

Suppose we have some data frame in R with 1,000 rows that contains information about hours studied by students and their corresponding score on a final exam:

#make this example reproducible
set.seed(0)

#create data frame
df frame(hours=runif(1000, min=0, max=10),
                 score=runif(1000, min=40, max=100))

#view head of data frame
head(df)

     hours    score
1 8.966972 55.93220
2 2.655087 71.84853
3 3.721239 81.09165
4 5.728534 62.99700
5 9.082078 97.29928
6 2.016819 47.10139

Suppose we would like to fit a simple linear regression model that uses hours studied to predict final exam score.

Suppose we would like to train the model on 80% of the rows in the data frame and test it on the remaining 20% of rows.

The following code shows how to use the createDataPartition() function from the caret package to split the data frame into training and testing sets:

library(caret)

#partition data frame into training and testing sets
train_indices 1, p=.8, list=FALSE)

#create training set
df_train #create testing set
df_test  #view number of rows in each set
nrow(df_train)

[1] 800

nrow(df_test)

[1] 200

We can see that our training dataset contains 800 rows, which represents 80% of the original dataset.

Similarly, we can see that our test dataset contains 200 rows, which represents 20% of the original dataset.

We can also view the first few rows of each set:

#view head of training set
head(df_train)

     hours    score
1 8.966972 55.93220
2 2.655087 71.84853
3 3.721239 81.09165
4 5.728534 62.99700
5 9.082078 97.29928
7 8.983897 42.34600

#view head of testing set
head(df_test)

      hours    score
6  2.016819 47.10139
12 2.059746 96.67170
18 7.176185 92.61150
23 2.121425 89.17611
24 6.516738 50.47970
25 1.255551 90.58483

We can then proceed to train the regression model using the training set and assess its performance using the testing set.

Additional Resources

The following tutorials explain how to use other common functions in R:

How to Perform K-Fold Cross Validation in R
How to Perform Multiple Linear Regression in R
How to Perform Logistic Regression in R

Subscribe

- Never miss a story with notifications

- Gain full access to our premium content

- Browse free from up to 5 devices at once

Latest stories