split
Divide into Groups and Reassemble
split
divides the data in the vector x
into the groups
defined by f
. The replacement forms replace values
corresponding to such a division. unsplit
reverses the effect of
split
.
- Keywords
- category
Usage
split(x, f, drop = FALSE, ...)
split(x, f, drop = FALSE, ...) <- value
unsplit(value, f, drop = FALSE)
Arguments
- x
- vector or data frame containing values to be divided into groups.
- f
- a factor in the sense that
as.factor(f)
defines the grouping, or a list of such factors in which case their interaction is used for the grouping. - drop
- logical indicating if levels that do not occur should be dropped
(if
f
is afactor
or a list). - value
- a list of vectors or data frames compatible with a
splitting of
x
. Recycling applies if the lengths do not match. - ...
- further potential arguments passed to methods.
Details
split
and split<-
are generic functions with default and
data.frame
methods. The data frame method can also be used to
split a matrix into a list of matrices, and the replacement form
likewise, provided they are invoked explicitly.
unsplit
works with lists of vectors or data frames (assumed to
have compatible structure, as if created by split
). It puts
elements or rows back in the positions given by f
. In the data
frame case, row names are obtained by unsplitting the row name
vectors from the elements of value
.
f
is recycled as necessary and if the length of x
is not
a multiple of the length of f
a warning is printed.
Any missing values in f
are dropped together with the
corresponding values of x
.
The default method calls interaction
. If the levels of
the factors contain . they may not be split as expected, so
the method has argument sep
which is use to join the levels.
Value
-
The value returned from
split
is a list of vectors containing
the values for the groups. The components of the list are named by
the levels of f
(after converting to a factor, or if already a
factor and drop = TRUE
, dropping unused levels).The replacement forms return their right hand side. unsplit
returns a vector or data frame for which split(x, f)
equals
value
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
See Also
cut
to categorize numeric values.
strsplit
to split strings.
Examples
library(base)
require(stats); require(graphics)
n <- 10; nn <- 100
g <- factor(round(n * runif(n * nn)))
x <- rnorm(n * nn) + sqrt(as.numeric(g))
xg <- split(x, g)
boxplot(xg, col = "lavender", notch = TRUE, varwidth = TRUE)
sapply(xg, length)
sapply(xg, mean)
### Calculate 'z-scores' by group (standardize to mean zero, variance one)
z <- unsplit(lapply(split(x, g), scale), g)
# or
zz <- x
split(zz, g) <- lapply(split(x, g), scale)
# and check that the within-group std dev is indeed one
tapply(z, g, sd)
tapply(zz, g, sd)
### data frame variation
## Notice that assignment form is not used since a variable is being added
g <- airquality$Month
l <- split(airquality, g)
l <- lapply(l, transform, Oz.Z = scale(Ozone))
aq2 <- unsplit(l, g)
head(aq2)
with(aq2, tapply(Oz.Z, Month, sd, na.rm = TRUE))
### Split a matrix into a list by columns
ma <- cbind(x = 1:10, y = (-4:5)^2)
split(ma, col(ma))
split(1:10, 1:2)
Community examples
A common usage of `split()` is to split a data frame by a factor. ```{r} data(PlantGrowth) split(PlantGrowth, PlantGrowth$group) ``` You can also split vectors. ```{r} with(PlantGrowth, split(weight, group)) ``` To split by numeric ranges, combine `split()` with `cut()`. ```{r} split(PlantGrowth, cut(PlantGrowth$weight, 3:7)) ``` `drop = TRUE` drops unused elements of length zero in the resulting list. ```{r} (f <- factor( sample(c("a", "e", "i", "o", "u"), 20, replace = TRUE), levels = letters )) split(1:20, f) split(1:20, f, drop = TRUE) ``` You can also split by passing a numeric index. ```{r} split(1:10, 1:2) ``` A more useful example is to split a data frame into a list of one-row data frames suitable for use with [`lapply()`](https://www.rdocumentation.org/packages/base/topics/lapply). ```{r} head(plant_list <- split(PlantGrowth, seq_len(nrow(PlantGrowth)))) sapply(plant_list, function(x) x$weight ^ 2) ``` To split by multiple factors, pass them in a list. ```{r} with(CO2, split(CO2, list(Plant, Type, Treatment), drop = TRUE)) ``` A common data analysis pattern is to split a dataset into groups, apply an action to each group, then combine those results together. See [The Split-Apply-Combine Strategy for Data Analysis](https://www.jstatsoft.org/article/view/v040i01) by Hadley Wickham (2011). ```{r} # split plant_weights_by_group <- with(PlantGrowth, split(weight, group)) # apply mean_plant_weights_by_group <- lapply(plant_weights_by_group, mean) # combine unlist(mean_plant_weights_by_group) ``` We can speed this up: [`sapply()`](https://www.rdocumentation.org/packages/base/topics/lapply) does the last two steps together. Even better, [`tapply()`](https://www.rdocumentation.org/packages/base/topics/tapply) does all three. ```{r} with(PlantGrowth, tapply(weight, group, mean)) ``` Another common split-apply-combine problem would be calculate 'z-scores' by group (scaled to mean zero, variance one). This time, the answer is the same length as the input. ```{r} plant_weights_by_group <- with(PlantGrowth, split(weight, group)) scaled_plant_weights_by_group <- lapply(plant_weights_by_group, scale) unlist(scaled_plant_weights_by_group) ``` The quick version uses [`ave()`](https://www.rdocumentation.org/packages/stats/topics/ave) instead of [`tapply()`](https://www.rdocumentation.org/packages/base/topics/tapply) ```{r} with(PlantGrowth, ave(weight, group, FUN = scale)) ``` Occasionally, you may want to reverse to splitting process. ```{r} plant_weights_by_group <- with(PlantGrowth, split(weight, group)) unsplit(plant_weights_by_group, PlantGrowth$group) ``` For vectors it is usually easier to use [`unlist()`](https://www.rdocumentation.org/packages/base/topics/unlist), since you don't need to remember what the splitter was. ```{r} plant_weights_by_group <- with(PlantGrowth, split(weight, group)) unlist(plant_weights_by_group) ``` Similarly, for data frames, [`do.call()`](https://www.rdocumentation.org/packages/base/topics/do.call) + [`rbind()`](https://www.rdocumentation.org/packages/base/topics/rbind) (or [`dplyr::bind_all()`](https://www.rdocumentation.org/packages/dplyr/topics/bind_all)) is usually easier. ```{r} plants_by_group <- split(PlantGrowth, PlantGrowth$group) do.call(rbind, plants_by_group) ``` There are many alternatives for solving split-apply-combine problems. `dplyr`'s [`group_by_()`](https://www.rdocumentation.org/packages/dplyr/topics/group_by) is its equivalent of `split()`. ```{r} library(dplyr) # tapply() equivalent PlantGrowth %>% group_by_(~ group) %>% summarize_(mean_weight = ~ mean(weight)) # ave() equivalent PlantGrowth %>% group_by_(~ group) %>% mutate_(scaled_weight = ~ scale(weight)) ``` Similarly, `data.table` has a `by` argument. ```{r} library(data.table) # tapply() equivalent as.data.table(PlantGrowth)[, .(mean_weight = mean(weight)), by = group] # ave() equivalent pg <- as.data.table(PlantGrowth)[, scaled_weight := scale(weight), by = group] print(pg) ```