slice.corpus: Subset documents using their positions

Description

slice() lets you index documents by their (integer) locations. It allows you to select, remove, and duplicate documents. It is accompanied by a number of helpers for common use cases:

slice_head() and slice_tail() select the first or last documents.
slice_sample() randomly selects documents.
slice_min() and slice_max() select documents with highest or lowest values of a document variable.

Usage

# S3 method for corpus
slice(.data, ..., .preserve = FALSE)
# S3 method for corpus
slice_head(.data, ..., n, prop)
# S3 method for corpus
slice_tail(.data, ..., n, prop)
# S3 method for corpus
slice_sample(.data, ..., n, prop, weight_by = NULL, replace = FALSE)
# S3 method for corpus
slice_min(.data, ..., n, prop, with_ties = TRUE)
# S3 method for corpus
slice_max(.data, ..., n, prop, with_ties = TRUE)

Value

An object of the same type as .data. The output has the following properties:

Each document may appear 0, 1, or many times in the output. (If duplicated, then document names will be modified to remain unique.)
Document variables are not modified.

Arguments

.data

a quanteda corpus object

...

additional arguments passed to methods

.preserve

Relevant when the .data input is grouped. If .preserve = FALSE (the default), the grouping structure is recalculated based on the resulting data, otherwise the grouping is kept as is.

n, prop

Provide either n, the number of documents, or prop, the proportion of documents to select. If neither are supplied, n = 1 will be used.

If n is greater than the number of rows in the group (or prop > 1), the result will be silently truncated to the group size. If the proportion of a group size is not an integer, it is rounded down.

weight_by

<data-masking> Sampling weights. This must evaluate to a vector of non-negative numbers the same length as the input. Weights are automatically standardised to sum to 1.

replace

Should sampling be performed with (TRUE) or without (FALSE, the default) replacement.

with_ties

Should ties be kept together? The default, TRUE, may return more rows than you request. Use FALSE to ignore ties, and return the first n rows.

Examples

Run this code

slice(data_corpus_inaugural, 2:5)
slice(data_corpus_inaugural, 55:n())
slice_head(data_corpus_inaugural, n = 2)
slice_tail(data_corpus_inaugural, n = 3)
slice_tail(data_corpus_inaugural, prop = .05)

set.seed(42)
slice_sample(data_corpus_inaugural, n = 3)
slice_sample(data_corpus_inaugural, prop = .10, replace = TRUE)

data_corpus_inaugural <- data_corpus_inaugural %>%
    mutate(ntoks = ntoken(data_corpus_inaugural))
# shortest three texts
slice_min(data_corpus_inaugural, ntoks, n = 3)
# longest three texts
slice_max(data_corpus_inaugural, ntoks, n = 3)

Run the code above in your browser using DataLab