This function must be run before annotating text with the tokenizers backend. It sets the properties for the soreNLP engine and loads the file using rJava interface provided by reticulate. See Details for more information about the speed codes.
init_coreNLP(language, speed = 2, lib_location = NULL, mem = "12g",
verbose = FALSE)
a character vector describing the desired language; should be one of: "ar", "de", "en", "es", "fr", or "zh".
integer code. Sets which annotators should be loaded, based on on how long they take to load and run. Speed 0 is the fastest, and speed 8 is the slowest. See Details for a full description of the levels
a string giving the location of the CoreNLP java files. This should point to a directory which contains, for example the file "stanford-corenlp-*.jar", where "*" is the version number. If missing, the function will try to find the library in the environment variable CORENLP_HOME, and otherwise will fail. (Java model only)
a string giving the amount of memory to be assigned to the rJava
engine. For example, "6g" assigned 6 gigabytes of memory. At least
2 gigabytes are recommended at a minimum for running the CoreNLP
package. On a 32bit machine, where this is not possible, setting
"1800m" may also work. This option will only have an effect the first
time init_backend
is called for the coreNLP backend, and also
will not have an effect if the java engine is already started by
another process.
boolean. Should messages from the pipeline be written to the console or suppressed?
Currently available speed codes are integers from 0 to 8. Setting speed above 2 has no additional effect on the German and Spanish models. Setting above 1 has no effect on the French model. The available speed codes are:
"0" runs just the tokenizer, sentence splitter, and part of speech tagger. Extremely fast.
"1" includes the dependency parsers and, for English, the sentiment tagger. Often 20-30x slower than speed 0.
"2" adds the named entity annotator to the parser and sentiment tagger (when available). For English models, it also includes the mentions and natlog annotators. Usually no more than twice as slow as speed 1.
"3" add the coreference resolution annotator to the speed 2 annotators. Depending on the corpus, this takes about 2-4x longer than the speed 2 annotators
We suggest starting at speed 2 and down grading to 0 if your corpus is particularly large, or upgrading to 3 if you sacrifice the slowdown. If your text is not formal written text (i.e., tweets or text messages), the speed 0 annotator should still work well but anything beyond that may be difficult. Semi-formal text such as e-mails or transcribed speech are generally okay to run for all of the levels.
# NOT RUN {
init_coreNLP("en")
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab