koRpus (version 0.13-8)

types: Get types and tokens of a given text

Description

These methods return character vectors that return all types or tokens of a given text, where text can either be a character vector itself, a previosly tokenized/tagged koRpus object, or an object of class kRp.TTR.

Usage

types(txt, ...)

tokens(txt, ...)

# S4 method for kRp.TTR types(txt, stats = FALSE)

# S4 method for kRp.TTR tokens(txt)

# S4 method for kRp.text types( txt, case.sens = FALSE, lemmatize = FALSE, corp.rm.class = "nonpunct", corp.rm.tag = c(), stats = FALSE )

# S4 method for kRp.text tokens( txt, case.sens = FALSE, lemmatize = FALSE, corp.rm.class = "nonpunct", corp.rm.tag = c() )

# S4 method for character types( txt, case.sens = FALSE, lemmatize = FALSE, corp.rm.class = "nonpunct", corp.rm.tag = c(), stats = FALSE, lang = NULL )

# S4 method for character tokens( txt, case.sens = FALSE, lemmatize = FALSE, corp.rm.class = "nonpunct", corp.rm.tag = c(), lang = NULL )

Arguments

txt

An object of either class kRp.text or kRp.TTR, or a character vector.

...

Only used for the method generic.

stats

Logical, whether statistics on the length in characters and frequency of types in the text should also be returned.

case.sens

Logical, whether types should be counted case sensitive. This option is available for tagged text and character input only.

lemmatize

Logical, whether analysis should be carried out on the lemmatized tokens rather than all running word forms. This option is available for tagged text and character input only.

corp.rm.class

A character vector with word classes which should be dropped. The default value "nonpunct" has special meaning and will cause the result of kRp.POS.tags(lang, tags=c("punct","sentc"), list.classes=TRUE) to be used. This option is available for tagged text and character input only.

corp.rm.tag

A character vector with POS tags which should be dropped. This option is available for tagged text and character input only.

lang

Set the language of a text, see the force.lang option of lex.div. This option is available for character input only.

Value

A character vector. Fortypes and stats=TRUE a data.frame containing all types, their length (characters) and frequency. The types result is always sorted by frequency, with more frequent types coming first.

See Also

kRp.POS.tags, kRp.text, kRp.TTR, lex.div

Examples

Run this code
# NOT RUN {
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )

  types(tokenized.obj)
  tokens(tokenized.obj)
} else {}
# }

Run the code above in your browser using DataCamp Workspace