This function builds an object of the class types.
types(
x,
re_drop_line = NULL,
line_glue = NULL,
re_cut_area = NULL,
re_token_splitter = re("[^_\\p{L}\\p{N}\\p{M}'-]+"),
re_token_extractor = re("[_\\p{L}\\p{N}\\p{M}'-]+"),
re_drop_token = NULL,
re_token_transf_in = NULL,
token_transf_out = NULL,
token_to_lower = TRUE,
perl = TRUE,
blocksize = 300,
verbose = FALSE,
show_dots = FALSE,
dot_blocksize = 10,
file_encoding = "UTF-8",
ngram_size = NULL,
ngram_sep = "_",
ngram_n_open = 0,
ngram_open = "[]",
as_text = FALSE
)An object of the class types, which is based on a character vector.
It has additional attributes and methods such as:
base print(), as_data_frame(), sort() and
base::summary() (which returns the number of items and of unique items),
the n_types() getter and the explore() method,
subsetting methods such as keep_types(), keep_pos(), etc. including []
subsetting (see brackets).
An object of class types can be merged with another by means of types_merge(),
written to file with write_types() and read from file with write_types().
Either a list of filenames of the corpus files
(if as_text is TRUE) or the actual text of the corpus
(if as_text is FALSE).
If as_text is TRUE and the length of the vector x
is higher than one, then each item in x is treated as a separate
line (or a separate series of lines) in the corpus text. Within each
item of x, the character "\\n" is also treated as
a line separator.
NULL or character vector. If NULL, it is ignored.
Otherwise, a character vector (assumed to be of length 1)
containing a regular expression. Lines in x
that contain a match for re_drop_line are
treated as not belonging to the corpus and are excluded from the results.
NULL or character vector. If NULL, it is ignored.
Otherwise, all lines in a corpus file (or in x, if
as_text is TRUE), are glued together in one
character vector of length 1, with the string line_glue
pasted in between consecutive lines.
The value of line_glue can also be equal to the empty string "".
The 'line glue' operation is conducted immediately after the 'drop line' operation.
NULL or character vector. If NULL, it is ignored.
Otherwise, all matches in a corpus file (or in x,
if as_text is TRUE), are 'cut out' of the text prior
to the identification of the tokens in the text (and are therefore
not taken into account when identifying the tokens).
The 'cut area' operation is conducted immediately after the 'line glue' operation.
Regular expression or NULL.
Regular expression that identifies the locations where lines in the corpus
files are split into tokens. (See Details.)
The 'token identification' operation is conducted immediately after the 'cut area' operation.
Regular expression that identifies the locations of the
actual tokens. This argument is only used if re_token_splitter is NULL.
(See Details.)
The 'token identification' operation is conducted immediately after the 'cut area' operation.
Regular expression or NULL. If NULL, it is ignored.
Otherwise, it identifies tokens that are to
be excluded from the results. Any token that contains a match for
re_drop_token is removed from the results.
The 'drop token' operation is conducted immediately after the 'token identification' operation.
Regular expression that identifies areas in the
tokens that are to be transformed. This argument works together with the argument
token_transf_out.
If both re_token_transf_in and token_transf_out differ
from NA, then all matches, in the tokens, for the
regular expression re_token_transf_in are replaced with
the replacement string token_transf_out.
The 'token transformation' operation is conducted immediately after the 'drop token' operation.
Replacement string. This argument works together with
re_token_transf_in and is ignored if re_token_transf_in
is NULL or NA.
Logical. Whether tokens must be converted to lowercase before returning the result. The 'token to lower' operation is conducted immediately after the 'token transformation' operation.
Logical. Whether the PCRE regular expression flavor is being used in the arguments that contain regular expressions.
Number that indicates how many corpus files are read to memory
at each individual step' during the steps in the procedure; normally the default value of 300` should not
be changed, but when one works with exceptionally small corpus files,
it may be worthwhile to use a higher number, and when one works with
exceptionally large corpus files, it may be worthwhile to use a lower number.
IfTRUE, messages are printed to the console to
indicate progress.
If TRUE, dots are printed to the console to
indicate progress.
File encoding that is assumed in the corpus files.
Argument in support of ngrams/skipgrams (see also max_skip).
If one wants to identify individual tokens, the value of ngram_size
should be NULL or 1. If one wants to retrieve
token ngrams/skipgrams, ngram_size should be an integer indicating
the size of the ngrams/skipgrams. E.g. 2 for bigrams, or 3 for
trigrams, etc.
Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function.
If ngram_size is 2 or higher, and moreover
ngram_n_open is a number higher than 0, then
ngrams with 'open slots' in them are retrieved. These
ngrams with 'open slots' are generalizations of fully lexically specific
ngrams (with the generalization being that one or more of the items
in the ngram are replaced by a notation that stands for 'any arbitrary token').
For instance, if ngram_size is 4 and ngram_n_open is
1, and if moreover the input contains a
4-gram "it_is_widely_accepted", then the output will contain
all modifications of "it_is_widely_accepted" in which one (since
ngram_n_open is 1) of the items in this n-gram is
replaced by an open slot. The first and the last item inside
an ngram are never turned into an open slot; only the items in between
are candidates for being turned into open slots. Therefore, in the
example, the output will contain "it_[]_widely_accepted" and
"it_is_[]_accepted".
As a second example, if ngram_size is 5 and
ngram_n_open is 2, and if moreover the input contains a
5-gram "it_is_widely_accepted_that", then the output will contain
"it_[]_[]_accepted_that", "it_[]_widely_[]_that", and
"it_is_[]_[]_that".
Character string used to represent open slots in ngrams in the output of this function.
Logical.
Whether x is to be interpreted as a character vector containing the
actual contents of the corpus (if as_text is TRUE)
or as a character vector containing the names of the corpus files
(if as_text is FALSE).
If if as_text is TRUE, then the arguments
blocksize, verbose, show_dots, dot_blocksize,
and file_encoding are ignored.
The actual token identification is either based on the re_token_splitter
argument, a regular expression that identifies the areas between the tokens,
or on re_token_extractor, a regular expression that identifies the area
that are the tokens.
The first mechanism is the default mechanism: the argument re_token_extractor
is only used if re_token_splitter is NULL.
Currently the implementation of
re_token_extractor is a lot less time-efficient than that of re_token_splitter.
as_types()
toy_corpus <- "Once upon a time there was a tiny toy corpus.
It consisted of three sentences. And it lived happily ever after."
(tps <- types(toy_corpus, as_text = TRUE))
print(tps)
as.data.frame(tps)
as_tibble(tps)
sort(tps)
sort(tps, decreasing = TRUE)
Run the code above in your browser using DataLab