lex.div(txt, ...)
"lex.div"(txt, segment = 100, factor.size = 0.72, min.tokens = 9, rand.sample = 42, window = 100, case.sens = FALSE, lemmatize = FALSE, detailed = FALSE, measure = c("TTR", "MSTTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD", "MTLD-MA"), char = c("TTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD", "MTLD-MA"), char.steps = 5, log.base = 10, force.lang = NULL, keep.tokens = FALSE, corp.rm.class = "nonpunct", corp.rm.tag = c(), quiet = FALSE)
"lex.div"(txt, segment = 100, factor.size = 0.72, min.tokens = 9, rand.sample = 42, window = 100, case.sens = FALSE, lemmatize = FALSE, detailed = FALSE, measure = c("TTR", "MSTTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD", "MTLD-MA"), char = c("TTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD", "MTLD-MA"), char.steps = 5, log.base = 10, force.lang = NULL, keep.tokens = FALSE, corp.rm.class = "nonpunct", corp.rm.tag = c(), quiet = FALSE)
kRp.tagged-class
,
kRp.txt.freq-class
,
kRp.analysis-class
or kRp.txt.trans-class
,
containing the tagged text to be analyzed.log
for details.TRUE
all raw tokens and types will be preserved in the resulting object,
in a slot called
tt
. For the types, also their frequency in the analyzed text will be listed."nonpunct"
has special meaning and will cause the result of
kRp.POS.tags(lang, c("punct","sentc"), list.classes=TRUE)
to be used.FALSE
, short status messages will be shown.
TRUE
will also suppress all potential warnings regarding the validation status of measures.kRp.TTR-class
.
lex.div
calculates a variety of proposed indices for lexical diversity. In the following formulae,
$N$ refers to
the total number of tokens, and $V$ to the number of types:
"TTR"
:TTR
"MSTTR"
:Wrapper function: MSTTR
"MATTR"
:Wrapper function: MATTR
"C"
:
Wrapper function: C.ld
"R"
:
Wrapper function: R.ld
"CTTR"
:
Wrapper function: CTTR
"U"
:
Wrapper function: U.ld
"S"
:
Wrapper function: S.ld
"K"
:Wrapper function: K.ld
"Maas"
:koRpus
< 0.04-12) reported $a^2$,
and not $a$. The measure was derived from a formula by M\"uller (1969, as cited in Maas, 1972).
$\lg{}_{e}{V_0}$ is equivalent to $\lg{V_0}$,
only with $e$ as the base for the logarithms. Also calculated are $a$, $\lg{V_0}$ (both not the same
as before) and $V'$ as measures of relative vocabulary growth while the text progresses. To calculate these measures,
the first half of the text and the full text
will be examined (see Maas, 1972, p. 67 ff. for details).Wrapper function: maas
"MTLD"
:Wrapper function: MTLD
"MTLD-MA"
:min.tokens
threshold are dropped.Wrapper function: MTLD
"HD-D"
:Wrapper function: HDD
By default, if the text has to be tagged yet,
the language definition is queried by calling get.kRp.env(lang=TRUE)
internally.
Or, if txt
has already been tagged,
by default the language definition of that tagged object is read
and used. Set force.lang=get.kRp.env(lang=TRUE)
or to any other valid value,
if you want to forcibly overwrite this
default behaviour,
and only then. See kRp.POS.tags
for all supported languages.
Maas, H.-D., (1972). \"Uber den Zusammenhang zwischen Wortschatzumfang und L\"ange eines Textes. Zeitschrift f\"ur Literaturwissenschaft und Linguistik, 2(8), 73--96.
McCarthy, P.M. & Jarvis, S. (2007). vocd: A theoretical and empirical evaluation. Language Testing, 24(4), 459--488.
McCarthy, P.M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaces to lexical diversity assessment. Behaviour Research Methods, 42(2), 381--392.
Tweedie. F.J. & Baayen, R.H. (1998). How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities, 32(5), 323--352.
kRp.POS.tags
,
kRp.tagged-class
, kRp.TTR-class
## Not run:
# lex.div(tagged.text)
# ## End(Not run)
Run the code above in your browser using DataLab