Divide into Groups and Reassemble

split divides the data in the vector x into the groups defined by f. The replacement forms replace values corresponding to such a division. unsplit reverses the effect of split.

split(x, f, drop = FALSE, ...) split(x, f, drop = FALSE, ...) <- value unsplit(value, f, drop = FALSE)
vector or data frame containing values to be divided into groups.
a ‘factor’ in the sense that as.factor(f) defines the grouping, or a list of such factors in which case their interaction is used for the grouping.
logical indicating if levels that do not occur should be dropped (if f is a factor or a list).
a list of vectors or data frames compatible with a splitting of x. Recycling applies if the lengths do not match.
further potential arguments passed to methods.

split and split<- are generic functions with default and data.frame methods. The data frame method can also be used to split a matrix into a list of matrices, and the replacement form likewise, provided they are invoked explicitly.

unsplit works with lists of vectors or data frames (assumed to have compatible structure, as if created by split). It puts elements or rows back in the positions given by f. In the data frame case, row names are obtained by unsplitting the row name vectors from the elements of value.

f is recycled as necessary and if the length of x is not a multiple of the length of f a warning is printed.

Any missing values in f are dropped together with the corresponding values of x.

The default method calls interaction. If the levels of the factors contain . they may not be split as expected, so the method has argument sep which is use to join the levels.


The value returned from split is a list of vectors containing the values for the groups. The components of the list are named by the levels of f (after converting to a factor, or if already a factor and drop = TRUE, dropping unused levels).The replacement forms return their right hand side. unsplit returns a vector or data frame for which split(x, f) equals value


Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

See Also

cut to categorize numeric values.

strsplit to split strings.

library(base) require(stats); require(graphics) n <- 10; nn <- 100 g <- factor(round(n * runif(n * nn))) x <- rnorm(n * nn) + sqrt(as.numeric(g)) xg <- split(x, g) boxplot(xg, col = "lavender", notch = TRUE, varwidth = TRUE) sapply(xg, length) sapply(xg, mean) ### Calculate 'z-scores' by group (standardize to mean zero, variance one) z <- unsplit(lapply(split(x, g), scale), g) # or zz <- x split(zz, g) <- lapply(split(x, g), scale) # and check that the within-group std dev is indeed one tapply(z, g, sd) tapply(zz, g, sd) ### data frame variation ## Notice that assignment form is not used since a variable is being added g <- airquality$Month l <- split(airquality, g) l <- lapply(l, transform, Oz.Z = scale(Ozone)) aq2 <- unsplit(l, g) head(aq2) with(aq2, tapply(Oz.Z, Month, sd, na.rm = TRUE)) ### Split a matrix into a list by columns ma <- cbind(x = 1:10, y = (-4:5)^2) split(ma, col(ma)) split(1:10, 1:2)
A common usage of `split()` is to split a data frame by a factor. ```{r} data(PlantGrowth) split(PlantGrowth, PlantGrowth$group) ``` You can also split vectors. ```{r} with(PlantGrowth, split(weight, group)) ``` To split by numeric ranges, combine `split()` with `cut()`. ```{r} split(PlantGrowth, cut(PlantGrowth$weight, 3:7)) ``` `drop = TRUE` drops unused elements of length zero in the resulting list. ```{r} (f <- factor( sample(c("a", "e", "i", "o", "u"), 20, replace = TRUE), levels = letters )) split(1:20, f) split(1:20, f, drop = TRUE) ``` You can also split by passing a numeric index. ```{r} split(1:10, 1:2) ``` A more useful example is to split a data frame into a list of one-row data frames suitable for use with [`lapply()`](https://www.rdocumentation.org/packages/base/topics/lapply). ```{r} head(plant_list <- split(PlantGrowth, seq_len(nrow(PlantGrowth)))) sapply(plant_list, function(x) x$weight ^ 2) ``` To split by multiple factors, pass them in a list. ```{r} with(CO2, split(CO2, list(Plant, Type, Treatment), drop = TRUE)) ``` A common data analysis pattern is to split a dataset into groups, apply an action to each group, then combine those results together. See [The Split-Apply-Combine Strategy for Data Analysis](https://www.jstatsoft.org/article/view/v040i01) by Hadley Wickham (2011). ```{r} # split plant_weights_by_group <- with(PlantGrowth, split(weight, group)) # apply mean_plant_weights_by_group <- lapply(plant_weights_by_group, mean) # combine unlist(mean_plant_weights_by_group) ``` We can speed this up: [`sapply()`](https://www.rdocumentation.org/packages/base/topics/lapply) does the last two steps together. Even better, [`tapply()`](https://www.rdocumentation.org/packages/base/topics/tapply) does all three. ```{r} with(PlantGrowth, tapply(weight, group, mean)) ``` Another common split-apply-combine problem would be calculate 'z-scores' by group (scaled to mean zero, variance one). This time, the answer is the same length as the input. ```{r} plant_weights_by_group <- with(PlantGrowth, split(weight, group)) scaled_plant_weights_by_group <- lapply(plant_weights_by_group, scale) unlist(scaled_plant_weights_by_group) ``` The quick version uses [`ave()`](https://www.rdocumentation.org/packages/stats/topics/ave) instead of [`tapply()`](https://www.rdocumentation.org/packages/base/topics/tapply) ```{r} with(PlantGrowth, ave(weight, group, FUN = scale)) ``` Occasionally, you may want to reverse to splitting process. ```{r} plant_weights_by_group <- with(PlantGrowth, split(weight, group)) unsplit(plant_weights_by_group, PlantGrowth$group) ``` For vectors it is usually easier to use [`unlist()`](https://www.rdocumentation.org/packages/base/topics/unlist), since you don't need to remember what the splitter was. ```{r} plant_weights_by_group <- with(PlantGrowth, split(weight, group)) unlist(plant_weights_by_group) ``` Similarly, for data frames, [`do.call()`](https://www.rdocumentation.org/packages/base/topics/do.call) + [`rbind()`](https://www.rdocumentation.org/packages/base/topics/rbind) (or [`dplyr::bind_all()`](https://www.rdocumentation.org/packages/dplyr/topics/bind_all)) is usually easier. ```{r} plants_by_group <- split(PlantGrowth, PlantGrowth$group) do.call(rbind, plants_by_group) ``` There are many alternatives for solving split-apply-combine problems. `dplyr`'s [`group_by_()`](https://www.rdocumentation.org/packages/dplyr/topics/group_by) is its equivalent of `split()`. ```{r} library(dplyr) # tapply() equivalent PlantGrowth %>% group_by_(~ group) %>% summarize_(mean_weight = ~ mean(weight)) # ave() equivalent PlantGrowth %>% group_by_(~ group) %>% mutate_(scaled_weight = ~ scale(weight)) ``` Similarly, `data.table` has a `by` argument. ```{r} library(data.table) # tapply() equivalent as.data.table(PlantGrowth)[, .(mean_weight = mean(weight)), by = group] # ave() equivalent pg <- as.data.table(PlantGrowth)[, scaled_weight := scale(weight), by = group] print(pg) ```