quanteda (version 2.1.2)

ntoken: Count the number of tokens or types

Description

Get the count of tokens (total features) or types (unique tokens).

Usage

ntoken(x, ...)

ntype(x, ...)

Arguments

x

a quanteda object: a character, corpus, tokens, or dfm object

...

additional arguments passed to tokens()

Value

named integer vector of the counts of the total tokens or types

Details

The precise definition of "tokens" for objects not yet tokenized (e.g. character or corpus objects) can be controlled through optional arguments passed to tokens() through ....

For dfm objects, ntype will only return the count of features that occur more than zero times in the dfm.

Examples

Run this code
# NOT RUN {
# simple example
txt <- c(text1 = "This is a sentence, this.", text2 = "A word. Repeated repeated.")
ntoken(txt)
ntype(txt)
ntoken(char_tolower(txt))  # same
ntype(char_tolower(txt))   # fewer types
ntoken(char_tolower(txt), remove_punct = TRUE)
ntype(char_tolower(txt), remove_punct = TRUE)

# with some real texts
ntoken(corpus_subset(data_corpus_inaugural, Year < 1806), remove_punct = TRUE)
ntype(corpus_subset(data_corpus_inaugural, Year < 1806), remove_punct = TRUE)
ntoken(dfm(corpus_subset(data_corpus_inaugural, Year < 1800)))
ntype(dfm(corpus_subset(data_corpus_inaugural, Year < 1800)))
# }

Run the code above in your browser using DataCamp Workspace