
Returns document subsets of a tokens that meet certain conditions, including
direct logical operations on docvars (document-level variables).
tokens_subset
functions identically to
subset.data.frame
, using non-standard evaluation to evaluate
conditions based on the docvars in the tokens.
tokens_subset(x, subset, select, ...)
tokens object to be subsetted
logical expression indicating the documents to keep: missing values are taken as false
expression, indicating the docvars to select from the tokens; or a tokens object, in which case the returned tokens will contain the same documents in the same order as the original tokens, even if these are empty.
not used
tokens object, with a subset of documents (and docvars) selected according to arguments
# NOT RUN {
corp <- corpus(c(d1 = "a b c d", d2 = "a a b e",
d3 = "b b c e", d4 = "e e f a b"),
docvars = data.frame(grp = c(1, 1, 2, 3)))
toks1 <- tokens(corp)
# selecting on a docvars condition
tokens_subset(toks1, grp > 1)
# selecting on a supplied vector
tokens_subset(toks1, c(TRUE, FALSE, TRUE, FALSE))
# selecting on a tokens
toks2 <- tokens(c(d1 = "a b b c", d2 = "b b c d"))
toks3 <- tokens(c(d1 = "x y z", d2 = "a b c c d", d3 = "x x x"))
tokens_subset(toks2, subset = toks3)
tokens_subset(toks2, subset = toks3[c(3,1,2)])
# }
Run the code above in your browser using DataLab