manip: Data manipulation functions.

Description

These five functions form the backbone of dplyr. They are all S3 generic functions with methods for each individual data type. All functions work exactly the same way: the first argument is the tbl, and the subsequence arguments are interpreted in the context of that tbl.

Usage

filter(.data, ...)
summarise(.data, ...)
summarize(.data, ...)
mutate(.data, ...)
arrange(.data, ...)
select(.data, ...)

Arguments

.data

a tbl

...

variables interpreted in the context of that data frame.

Manipulation functions

The five key data manipulation functions are:

filter: return only a subset of the rows. If multiple conditions are supplied they are combined with &.
select: return only a subset of the columns. If multiple columns are supplied they are all used.
arrange: reorder the rows. Multiple inputs are ordered from left-to- right.
mutate: add new columns. Multiple inputs create multiple columns.
summarise: reduce each group to a single row. Multiple inputs create multiple output summaries.

These are all made significantly more useful when applied by group, as with group_by

Tbls

dplyr comes with three built-in tbls. Read the help for the manip methods of that class to get more details:

data.frame: manip_df
data.table: manip_dt
SQLite: src_sqlite
PostgreSQL: src_postgres
MySQL: src_mysql

Output

Generally, manipulation functions will return an output object of the same type as their input. The exceptions are:

summarise will return an ungrouped source
remote sources (like databases) will typically return a local source from at least summarise and mutate

Row names

dplyr methods do not preserve row names. If have been using row names to store important information, please make them explicit variables.

Arrange

Note that for local data frames, the ordering is done in C++ code which does not have access to the local specific ordering usually done in R. This means that strings are ordered as if in the C locale.

Selection

As well as using existing functions like : and c, there are a number of special functions that only work inside select

starts_with(x, ignore.case = FALSE): names starts with x
ends_with(x, ignore.case = FALSE): names ends in x
contains(x, ignore.case = FALSE): selects all variables whose name contains x
matches(x, ignore.case = FALSE): selects all variables whose name matches the regular expression x
num_range("x", 1:5, width = 2): selects all variables (numerically) from x01 to x05.

To drop variables, use -. You can rename variables with named arguments.

Examples

Run this code

filter(mtcars, cyl == 8)
select(mtcars, mpg, cyl, hp:vs)
arrange(mtcars, cyl, disp)
mutate(mtcars, displ_l = disp / 61.0237)
summarise(mtcars, mean(disp))
summarise(group_by(mtcars, cyl), mean(disp))
# More detailed select examples ------------------------------
iris <- tbl_df(iris) # so it prints a little nicer
select(iris, starts_with("Petal"))
select(iris, ends_with("Width"))
select(iris, contains("etal"))
select(iris, matches(".t."))
select(iris, Petal.Length, Petal.Width)

df <- as.data.frame(matrix(runif(100), nrow = 10))
df <- tbl_df(df[c(3, 4, 7, 1, 9, 8, 5, 2, 6, 10)])
select(df, V4:V6)
select(df, num_range("V", 4:6))

# Drop variables
select(iris, -starts_with("Petal"))
select(iris, -ends_with("Width"))
select(iris, -contains("etal"))
select(iris, -matches(".t."))
select(iris, -Petal.Length, -Petal.Width)

# Rename variables
select(iris, petal_length = Petal.Length)
select(iris, petal = starts_with("Petal"))

Run the code above in your browser using DataLab