- tokens
A data.frame in which rows represent tokens, and columns indicate (at least) the document in which the token occured (doc_col) and the position of the token in that document or globally (token_id_col)
- doc_col
The name of the column that contains the document ids/names
- token_id_col
The name of the column that contains the positions of tokens. If NULL, it is assumed that the data.frame is ordered by the order of tokens and does not contain gaps (e.g., filtered out tokens)
- token_col
Optionally, the name of the column that contains the token text. This column will then be renamed to "token" in the tcorpus, which is the default name
for many functions (e.g., querying, printing text)
- sentence_col
Optionally, the name of the column that indicates the sentences in which tokens occured. This can be necessary if tokens are not local at the document level (see token_is_local argument),
and sentence information can be used in several tcorpus functions.
- parent_col
Optionally, the name of the column that contains the id of the parent (if a dependency parser was used). If token_is_local = FALSE, then the token_ids will be transormed,
so parent ids need to be changed as well. Default is 'parent', but if this column is not present the parent is ignored.
- meta
Optionally, a data.frame with document meta data. Needs to contain a column with the document ids (with the same name)
- meta_cols
Alternatively, if there are document meta columns in the tokens data.table, meta_cols can be used to recognized them. Note that these values have to be unique within documents.
- feature_cols
Optionally, specify which columns to include in the tcorpus. If NULL, all column are included (except the specified columns for documents, sentences and positions)
- sent_is_local
Sentences in the tCorpus are assumed to be locally unique within documents. If sent_is_local is FALSE, then sentences are transformed to be locally unique. However, it is then assumed that the first sentence in a document is sentence 1, which might not be the case if tokens (input) is a subset.
- token_is_local
Same as sent_is_local, but for token_id. !! if the data has a parent column, make sure to specify parent_col, so that the parent ids are also transformed
- ...
not used