# split

##### Divide into Groups and Reassemble

`split`

divides the data in the vector `x`

into the groups
defined by `f`

. The replacement forms replace values
corresponding to such a division. `unsplit`

reverses the effect of
`split`

.

- Keywords
- category

##### Usage

```
split(x, f, drop = FALSE, …)
# S3 method for default
split(x, f, drop = FALSE, sep = ".", lex.order = FALSE, …)
```split(x, f, drop = FALSE, …) <- value
unsplit(value, f, drop = FALSE)

##### Arguments

- x
vector or data frame containing values to be divided into groups.

- f
a ‘factor’ in the sense that

`as.factor(f)`

defines the grouping, or a list of such factors in which case their interaction is used for the grouping.- drop
logical indicating if levels that do not occur should be dropped (if

`f`

is a`factor`

or a list).- value
a list of vectors or data frames compatible with a splitting of

`x`

. Recycling applies if the lengths do not match.- sep
character string, passed to

`interaction`

in the case where`f`

is a`list`

.- lex.order
logical, passed to

`interaction`

when`f`

is a list.- …
further potential arguments passed to methods.

##### Details

`split`

and `split<-`

are generic functions with default and
`data.frame`

methods. The data frame method can also be used to
split a matrix into a list of matrices, and the replacement form
likewise, provided they are invoked explicitly.

`unsplit`

works with lists of vectors or data frames (assumed to
have compatible structure, as if created by `split`

). It puts
elements or rows back in the positions given by `f`

. In the data
frame case, row names are obtained by unsplitting the row name
vectors from the elements of `value`

.

`f`

is recycled as necessary and if the length of `x`

is not
a multiple of the length of `f`

a warning is printed.

Any missing values in `f`

are dropped together with the
corresponding values of `x`

.

The default method calls `interaction`

when `f`

is a
`list`

. If the levels of the factors contain `.`
the factors may not be split as expected, unless `sep`

is set to
string not present in the factor `levels`

.

##### Value

The value returned from `split`

is a list of vectors containing
the values for the groups. The components of the list are named by
the levels of `f`

(after converting to a factor, or if already a
factor and `drop = TRUE`

, dropping unused levels).

The replacement forms return their right hand side. `unsplit`

returns a vector or data frame for which `split(x, f)`

equals
`value`

##### References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
*The New S Language*.
Wadsworth & Brooks/Cole.

##### See Also

`cut`

to categorize numeric values.

`strsplit`

to split strings.

##### Examples

`library(base)`

```
# NOT RUN {
require(stats); require(graphics)
n <- 10; nn <- 100
g <- factor(round(n * runif(n * nn)))
x <- rnorm(n * nn) + sqrt(as.numeric(g))
xg <- split(x, g)
boxplot(xg, col = "lavender", notch = TRUE, varwidth = TRUE)
sapply(xg, length)
sapply(xg, mean)
### Calculate 'z-scores' by group (standardize to mean zero, variance one)
z <- unsplit(lapply(split(x, g), scale), g)
# or
zz <- x
split(zz, g) <- lapply(split(x, g), scale)
# and check that the within-group std dev is indeed one
tapply(z, g, sd)
tapply(zz, g, sd)
### data frame variation
## Notice that assignment form is not used since a variable is being added
g <- airquality$Month
l <- split(airquality, g)
l <- lapply(l, transform, Oz.Z = scale(Ozone))
aq2 <- unsplit(l, g)
head(aq2)
with(aq2, tapply(Oz.Z, Month, sd, na.rm = TRUE))
### Split a matrix into a list by columns
ma <- cbind(x = 1:10, y = (-4:5)^2)
split(ma, col(ma))
split(1:10, 1:2)
# }
```

*Documentation reproduced from package base, version 3.5.0, License: Part of R 3.5.0*

### Community examples

**richie@datacamp.com**at Jan 17, 2017 base v3.3.2

A common usage of `split()` is to split a data frame by a factor. ```{r} data(PlantGrowth) split(PlantGrowth, PlantGrowth$group) ``` You can also split vectors. ```{r} with(PlantGrowth, split(weight, group)) ``` To split by numeric ranges, combine `split()` with `cut()`. ```{r} split(PlantGrowth, cut(PlantGrowth$weight, 3:7)) ``` `drop = TRUE` drops unused elements of length zero in the resulting list. ```{r} (f <- factor( sample(c("a", "e", "i", "o", "u"), 20, replace = TRUE), levels = letters )) split(1:20, f) split(1:20, f, drop = TRUE) ``` You can also split by passing a numeric index. ```{r} split(1:10, 1:2) ``` A more useful example is to split a data frame into a list of one-row data frames suitable for use with [`lapply()`](https://www.rdocumentation.org/packages/base/topics/lapply). ```{r} head(plant_list <- split(PlantGrowth, seq_len(nrow(PlantGrowth)))) sapply(plant_list, function(x) x$weight ^ 2) ``` To split by multiple factors, pass them in a list. ```{r} with(CO2, split(CO2, list(Plant, Type, Treatment), drop = TRUE)) ``` A common data analysis pattern is to split a dataset into groups, apply an action to each group, then combine those results together. See [The Split-Apply-Combine Strategy for Data Analysis](https://www.jstatsoft.org/article/view/v040i01) by Hadley Wickham (2011). ```{r} # split plant_weights_by_group <- with(PlantGrowth, split(weight, group)) # apply mean_plant_weights_by_group <- lapply(plant_weights_by_group, mean) # combine unlist(mean_plant_weights_by_group) ``` We can speed this up: [`sapply()`](https://www.rdocumentation.org/packages/base/topics/lapply) does the last two steps together. Even better, [`tapply()`](https://www.rdocumentation.org/packages/base/topics/tapply) does all three. ```{r} with(PlantGrowth, tapply(weight, group, mean)) ``` Another common split-apply-combine problem would be calculate 'z-scores' by group (scaled to mean zero, variance one). This time, the answer is the same length as the input. ```{r} plant_weights_by_group <- with(PlantGrowth, split(weight, group)) scaled_plant_weights_by_group <- lapply(plant_weights_by_group, scale) unlist(scaled_plant_weights_by_group) ``` The quick version uses [`ave()`](https://www.rdocumentation.org/packages/stats/topics/ave) instead of [`tapply()`](https://www.rdocumentation.org/packages/base/topics/tapply) ```{r} with(PlantGrowth, ave(weight, group, FUN = scale)) ``` Occasionally, you may want to reverse to splitting process. ```{r} plant_weights_by_group <- with(PlantGrowth, split(weight, group)) unsplit(plant_weights_by_group, PlantGrowth$group) ``` For vectors it is usually easier to use [`unlist()`](https://www.rdocumentation.org/packages/base/topics/unlist), since you don't need to remember what the splitter was. ```{r} plant_weights_by_group <- with(PlantGrowth, split(weight, group)) unlist(plant_weights_by_group) ``` Similarly, for data frames, [`do.call()`](https://www.rdocumentation.org/packages/base/topics/do.call) + [`rbind()`](https://www.rdocumentation.org/packages/base/topics/rbind) (or [`dplyr::bind_all()`](https://www.rdocumentation.org/packages/dplyr/topics/bind_all)) is usually easier. ```{r} plants_by_group <- split(PlantGrowth, PlantGrowth$group) do.call(rbind, plants_by_group) ``` There are many alternatives for solving split-apply-combine problems. `dplyr`'s [`group_by_()`](https://www.rdocumentation.org/packages/dplyr/topics/group_by) is its equivalent of `split()`. ```{r} library(dplyr) # tapply() equivalent PlantGrowth %>% group_by_(~ group) %>% summarize_(mean_weight = ~ mean(weight)) # ave() equivalent PlantGrowth %>% group_by_(~ group) %>% mutate_(scaled_weight = ~ scale(weight)) ``` Similarly, `data.table` has a `by` argument. ```{r} library(data.table) # tapply() equivalent as.data.table(PlantGrowth)[, .(mean_weight = mean(weight)), by = group] # ave() equivalent pg <- as.data.table(PlantGrowth)[, scaled_weight := scale(weight), by = group] print(pg) ```