Returns the subset of a tCorpus. The selection can be made separately (and simultaneously) for the token data (using subset) and the meta data (using subset_meta). The subset arguments work according to the subset.data.table function.
There are two flavours. You can either use subset(tc, ...) or tc$subset(...). The difference is that the second approach changes the tCorpus by reference.
In other words, tc$subset() will delete the rows from the tCorpus, instead of creating a new tCorpus.
Modifying the tCorpus by reference is more efficient (which becomes important if the tCorpus is large), but the more classic subset(tc, ...) approach is often more obvious.
Subset can also be used to select rows based on token/feature frequences. This is a common step in corpus analysis, where it often makes sense to ignore very rare and/or very frequent tokens.
To do so, there are several special functions that can be used within a subset call.
The freq_filter() and docfreq_filter() can be used to filter terms based on term frequency and document frequency, respectively. (see examples)
The subset_meta() method is an alternative for using subset(subset_meta = ...), that is added for consistency with the other _meta methods.
Note that you can also use the tCorpus$feature_subset method if you want to filter out low/high frequency tokens, but do not want to delete the rows in the tCorpus.
Usage:
## R6 method for class tCorpus. Use as tc$method (where tc is a tCorpus object).
subset(tc, subset = NULL, subset_meta = NULL,
window = NULL)
tc$subset(subset = NULL, subset_meta = NULL,
window = NULL, copy = F)
tc$subset_meta(subset = NULL, copy = F)