tokens_subset: Extract a subset of a tokens

Description

Returns document subsets of a tokens that meet certain conditions, including direct logical operations on docvars (document-level variables). tokens_subset functions identically to subset.data.frame, using non-standard evaluation to evaluate conditions based on the docvars in the tokens.

Usage

tokens_subset(x, subset, select, ...)

Arguments

tokens object to be subsetted

subset

logical expression indicating the documents to keep: missing values are taken as false

select

expression, indicating the docvars to select from the tokens; or a tokens object, in which case the returned tokens will contain the same documents in the same order as the original tokens, even if these are empty.

...

not used

Value

tokens object, with a subset of documents (and docvars) selected according to arguments

Examples

Run this code

# NOT RUN {
corp <- corpus(c(d1 = "a b c d", d2 = "a a b e",
                 d3 = "b b c e", d4 = "e e f a b"),
                 docvars = data.frame(grp = c(1, 1, 2, 3)))
toks1 <- tokens(corp)
# selecting on a docvars condition
tokens_subset(toks1, grp > 1)
# selecting on a supplied vector
tokens_subset(toks1, c(TRUE, FALSE, TRUE, FALSE))

# selecting on a tokens
toks2 <- tokens(c(d1 = "a b b c", d2 = "b b c d"))
toks3 <- tokens(c(d1 = "x y z", d2 = "a b c c d", d3 = "x x x"))
tokens_subset(toks2, subset = toks3)
tokens_subset(toks2, subset = toks3[c(3,1,2)])
# }

Run the code above in your browser using DataLab