Learn R Programming

koRpus (version 0.05-6)

read.tagged: Import already tagged texts

Description

This function can be used on text files containing already tagged text material, e.g. the results of TreeTagger[1].

Usage

read.tagged(file, lang = "kRp.env", encoding = NULL,
  tagger = "TreeTagger", apply.sentc.end = TRUE, sentc.end = c(".", "!",
  "?", ";", ":"), stopwords = NULL, stemmer = NULL, rm.sgml = TRUE)

Arguments

file
Either a connection or a character vector, valid path to a file, containing the previously analyzed text.
lang
A character string naming the language of the analyzed corpus. See kRp.POS.tags for all supported languages. If set to "kRp.env" this is got from
encoding
A character string defining the character encoding of the input file, like "Latin1" or "UTF-8". If NULL, the encoding will either be taken from a preset (if defined in TT.options), or fall b
tagger
The software which was used to tokenize and tag the text. Currently, TreeTagger is the only supported tagger.
apply.sentc.end
Logical, whethter the tokens defined in sentc.end should be searched and set to a sentence ending tag. You could call this a compatibility mode to make sure you get the results you would get if you called
sentc.end
A character vector with tokens indicating a sentence ending. This adds to given results, it doesn't replace them.
stopwords
A character vector to be used for stopword detection. Comparison is done in lower case. You can also simply set stopwords=tm::stopwords("en") to use the english stopwords provided by the tm package.
stemmer
A function or method to perform stemming. For instance, you can set stemmer=Snowball::SnowballStemmer if you have the Snowball package installed (or SnowballC::wordStem). As of now, you cannot provide fur
rm.sgml
Logical, whether SGML tags should be ignored and removed from output

Value

  • An object of class kRp.tagged-class. If debug=TRUE, prints internal variable settings and attempts to return the original output if the TreeTagger system call in a matrix.

Details

Note that the value of lang must match a valid language supported by kRp.POS.tags. It will also get stored in the resulting object and might be used by other functions at a later point.

References

Schmid, H. (1994). Probabilistic part-of-speec tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK, 44--49.

[1] http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

See Also

treetag, freq.analysis, get.kRp.env, kRp.tagged-class

Examples

Run this code
tagged.results <- read.tagged("~/my.data/tagged_speech.txt", lang="en")

Run the code above in your browser using DataLab