Learn R Programming

splitstackshape (version 1.4.0)

stratified: Take a Stratified Sample From a Dataset

Description

The stratified function samples from a data.frame or a data.table in which one or more columns can be used as a "stratification" or "grouping" variable. The result is a new data.table with the specified number of samples from each group.

Usage

stratified(indt, group, size, select = NULL, replace = FALSE,
  keep.rownames = FALSE, bothSets = FALSE, ...)

Arguments

indt
The input data.frame or data.table.
group
The column or columns that should be used to create the groups. Can be a character vector of column names (recommended) or a numeric vector of column positions. Generally, if you are using more than one variable to create your "strata", you should list th
size
The desired sample size.
  • Ifsizeis a value between0and1expressed as a decimal, size is set to be proportional to the number of observations per group.
  • Ifsizeis a single positive integer, i
select
A named list containing levels from the "group" variables in which you are interested. The list names must be present as variable names for the input dataset.
replace
Logical. Should sampling be with replacement? Defaults to FALSE.
keep.rownames
Logical. If the input is a data.frame or a matrix, as.data.table would normally drop the rownames. If TRUE, the rownames would be retained in a column named rn. Defaults to FALSE
bothSets
Logical. Should both the sampled and non-sampled sets be returned as a list? Defaults to FALSE.
...
Optional arguments to sample.

Value

  • If bothSets = FALSE, a list of two data.tables; otherwise, a data.table.

See Also

strata from the "strata" package; sample_n and sample_frac from "dplyr".

Examples

Run this code
# Generate a sample data.frame to play with
set.seed(1)
dat1 <- data.frame(ID = 1:100,
              A = sample(c("AA", "BB", "CC", "DD", "EE"),
                         100, replace = TRUE),
              B = rnorm(100), C = abs(round(rnorm(100), digits=1)),
              D = sample(c("CA", "NY", "TX"), 100, replace = TRUE),
              E = sample(c("M", "F"), 100, replace = TRUE))

# Let's take a 10\% sample from all -A- groups in dat1
stratified(dat1, "A", .1)

# Let's take a 10\% sample from only "AA" and "BB" groups from -A- in dat1
stratified(dat1, "A", .1, select = list(A = c("AA", "BB")))

# Let's take 5 samples from all -D- groups in dat1,
#   specified by column number
stratified(dat1, group = 5, size = 5)

# Use a two-column strata: -E- and -D-
#   -E- varies more slowly, so it is better to put that first
stratified(dat1, c("E", "D"), size = .15)

# Use a two-column strata (-E- and -D-) but only interested in
#   cases where -E- == "M"
stratified(dat1, c("E", "D"), .15, select = list(E = "M"))

## As above, but where -E- == "M" and -D- == "CA" or "TX"
stratified(dat1, c("E", "D"), .15,
     select = list(E = "M", D = c("CA", "TX")))

# Use a three-column strata: -E-, -D-, and -A-
s.out <- stratified(dat1, c("E", "D", "A"), size = 2)

Run the code above in your browser using DataLab