lex.div: Analyze lexical diversity

Description

This function analyzes the lexical diversity/complexity of a text corpus.

Usage

lex.div(txt, segment = 100, factor.size = 0.72, min.tokens = 9,
  rand.sample = 42, window = 100, case.sens = FALSE, lemmatize = FALSE,
  detailed = FALSE, measure = c("TTR", "MSTTR", "MATTR", "C", "R", "CTTR",
  "U", "S", "K", "Maas", "HD-D", "MTLD", "MTLD-MA"), char = c("TTR", "MATTR",
  "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD", "MTLD-MA"),
  char.steps = 5, log.base = 10, force.lang = NULL, keep.tokens = FALSE,
  corp.rm.class = "nonpunct", corp.rm.tag = c(), quiet = FALSE)

Arguments

txt

An object of either class kRp.tagged-class, kRp.txt.freq-class,

segment

An integer value for MSTTR, defining how many tokens should form one segment.

factor.size

A real number between 0 and 1, defining the MTLD factor size.

min.tokens

An integer value, how many tokens a full factor must at least have to be considered for the MTLD-MA result.

rand.sample

An integer value, how many tokens should be assumed to be drawn for calculating HD-D.

window

An integer value for MATTR, defining how many tokens the moving window should include.

case.sens

Logical, whether types should be counted case sensitive.

lemmatize

Logical, whether analysis should be carried out on the lemmatized tokens rather than all running word forms.

detailed

Logical, whether full details of the analysis should be calculated. This currently affects MTLD and MTLD-MA, defining if all factors should be kept in the object. This slows down calculations considerably.

measure

A character vector defining the measures which should be calculated. Valid elements are "TTR", "MSTTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD" and "MTLD-MA".

char

A character vector defining whether data for plotting characteristic curves should be calculated. Valid elements are "TTR","MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD" and "MTLD-MA".

char.steps

An integer value defining the stepwidth for characteristic curves, in tokens.

log.base

A numeric value defining the base of the logarithm. See log for details.

force.lang

A character string defining the language to be assumed for the text, by force. See details.

keep.tokens

Logical. If TRUE all raw tokens and types will be preserved in the resulting object, in a slot called tt. For the types, also their frequency in the analyzed text will be listed.

corp.rm.class

A character vector with word classes which should be dropped. The default value "nonpunct" has special meaning and will cause the result of kRp.POS.tags(lang, c("punct","sentc"), list.classes=TRUE) to be used.

corp.rm.tag

A character vector with POS tags which should be dropped.

quiet

Logical. If FALSE, short status messages will be shown. TRUE will also suppress all potential warnings regarding the validation status of measures.

Value

An object of class kRp.TTR-class.

Details

lex.div calculates a variety of proposed indices for lexical diversity. In the following formulae, $N$ refers to the total number of tokens, and $V$ to the number of types: [object Object],[object Object],[object Object],[object Object],Wrapper function: C.ld,[object Object],Wrapper function: R.ld,[object Object],Wrapper function: CTTR,[object Object],Wrapper function: U.ld,[object Object],Wrapper function: S.ld,[object Object],[object Object],[object Object],[object Object],[object Object]

By default, if the text has to be tagged yet, the language definition is queried by calling get.kRp.env(lang=TRUE) internally. Or, if txt has already been tagged, by default the language definition of that tagged object is read and used. Set force.lang=get.kRp.env(lang=TRUE) or to any other valid value, if you want to forcibly overwrite this default behaviour, and only then. See kRp.POS.tags for all supported languages.

References

Covington, M.A. & McFall, J.D. (2010). Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94--100.

Maas, H.-D., (1972). "Uber den Zusammenhang zwischen Wortschatzumfang und L"ange eines Textes. Zeitschrift f"ur Literaturwissenschaft und Linguistik, 2(8), 73--96.

McCarthy, P.M. & Jarvis, S. (2007). vocd: A theoretical and empirical evaluation. Language Testing, 24(4), 459--488.

McCarthy, P.M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaces to lexical diversity assessment. Behaviour Research Methods, 42(2), 381--392.

Tweedie. F.J. & Baayen, R.H. (1998). How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities, 32(5), 323--352.

Examples

Run this code

lex.div(tagged.text)

Run the code above in your browser using DataLab