# createDataPartition

##### Data Splitting functions

A series of test/training partitions are created using
`createDataPartition`

while `createResample`

creates one or
more bootstrap samples. `createFolds`

splits the data into
`k`

groups while `createTimeSlices`

creates cross-validation
sample information to be used with time series data.

- Keywords
- utilities

##### Usage

```
createDataPartition(y,
times = 1,
p = 0.5,
list = TRUE,
groups = min(5, length(y)))
createResample(y, times = 10, list = TRUE)
createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)
createMultiFolds(y, k = 10, times = 5)
createTimeSlices(y, initialWindow, horizon = 1,
fixedWindow = TRUE, skip = 0)
```

##### Arguments

- y
- a vector of outcomes. For
`createTimeSlices`

, these should be in chronological order. - times
- the number of partitions to create
- p
- the percentage of data that goes to training
- list
- logical - should the results be in a list (
`TRUE`

) or a matrix with the number of rows equal to`floor(p * length(y))`

and`times`

columns. - groups
- for numeric
`y`

, the number of breaks in the quantiles (see below) - k
- an integer for the number of folds.
- returnTrain
- a logical. When true, the values returned are the
sample positions corresponding to the data used during
training. This argument only works in conjunction with
`list = TRUE`

- initialWindow
- The initial number of consecutive values in each training set sample
- horizon
- The number of consecutive values in test set sample
- fixedWindow
- A logical: if
`FALSE`

, the training set always start at the first sample. - skip
- An integer specifying how many (if any) resamples to skip to thin the total amount.

##### Details

For bootstrap samples, simple random sampling is used.

For other data splitting, the random sampling is done within the
levels of `y`

when `y`

is a factor in an attempt to balance
the class distributions within the splits.

For numeric `y`

, the sample is split into groups sections based
on percentiles and sampling is done within these subgroups. For
`createDataPartition`

, the number of percentiles is set via the
`groups`

argument. For `createFolds`

and `createMultiFolds`

,
the number of groups is set dynamically based on the sample size and `k`

.
For smaller samples sizes, these two functions may not do stratified
splitting and, at most, will split the data into quartiles.

Also, for `createDataPartition`

, very small class sizes (<= 3)="" the="" classes="" may="" not="" show="" up="" in="" both="" training="" and="" test="" data<="" p="">

For multiple k-fold cross-validation, completely independent folds are created.
The names of the list objects will denote the fold membership using the pattern
"Foldi.Repj" meaning the ith section (of k) of the jth cross-validation set
(of `times`

). Note that this function calls `createFolds`

with
`list = TRUE`

and `returnTrain = TRUE`

.

Hyndman and Athanasopoulos (2013)) discuss rolling forecasting origin< techniques that move the training and test sets in time. `createTimeSlices`

can create the indices for this type of splitting.

##### Value

- A list or matrix of row position integers corresponding to the training data

##### References

Hyndman and Athanasopoulos (2013), Forecasting: principles and practice.

##### Examples

```
data(oil)
createDataPartition(oilType, 2)
x <- rgamma(50, 3, .5)
inA <- createDataPartition(x, list = FALSE)
plot(density(x[inA]))
rug(x[inA])
points(density(x[-inA]), type = "l", col = 4)
rug(x[-inA], col = 4)
createResample(oilType, 2)
createFolds(oilType, 10)
createFolds(oilType, 5, FALSE)
createFolds(rnorm(21))
createTimeSlices(1:9, 5, 1, fixedWindow = FALSE)
createTimeSlices(1:9, 5, 1, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = FALSE)
createTimeSlices(1:15, 5, 3)
createTimeSlices(1:15, 5, 3, skip = 2)
createTimeSlices(1:15, 5, 3, skip = 3)
```

*Documentation reproduced from package caret, version 6.0-52, License: GPL (>= 2)*