quanteda (version 2.1.2)

textstat_summary: Summarize documents

Description

Count the total number of number tokens and sentences.

Usage

textstat_summary(x, cache = TRUE, ...)

Arguments

x

corpus to be summarized

cache

if TRUE, use internal cache from the second time.

...

additional arguments passed through to dfm()

Details

Count the total number of characters, tokens and sentences as well as special tokens such as numbers, punctuation marks, symbols, tags and emojis.

  • chars = number of characters; equal to nchar()

  • sents = number of sentences; equal ntoken(tokens(x), what = "sentence")

  • tokens = number of tokens; equal to ntoken()

  • types = number of unique tokens; equal to ntype()

  • puncts = number of punctuation marks (^\p{P}+$)

  • numbers = number of numeric tokens (^\p{Sc}{0,1}\p{N}+([.,]*\p{N})*\p{Sc}{0,1}$)

  • symbols = number of symbols (^\p{S}$)

  • tags = number of tags; sum of pattern_username and pattern_hashtag in quanteda_options()

  • emojis = number of emojis (^\p{Emoji_Presentation}+$)

Examples

Run this code
# NOT RUN {
corp <- data_corpus_inaugural
textstat_summary(corp, cache = TRUE)
toks <- tokens(corp)
textstat_summary(toks, cache = TRUE)
dfmat <- dfm(toks)
textstat_summary(dfmat, cache = TRUE)
# }

Run the code above in your browser using DataLab