Learn R Programming

RKorAPClient (version 1.1.0)

corpusStats,KorAPConnection-method: Get corpus size and statistics

Description

Retrieve information about corpus size (documents, tokens, sentences, paragraphs) for the entire corpus or a virtual corpus subset.

Usage

# S4 method for KorAPConnection
corpusStats(kco, vc = "", verbose = kco@verbose, as.df = FALSE)

Value

Object containing corpus statistics with the following information:

vc

Virtual corpus definition used (empty string for entire corpus)

documents

Total number of documents in the (virtual) corpus

tokens

Total number of word tokens in the (virtual) corpus

sentences

Total number of sentences in the (virtual) corpus

paragraphs

Total number of paragraphs in the (virtual) corpus

webUIRequestUrl

URL to view this corpus subset in KorAP web interface

When as.df=TRUE, returns a data frame with these columns. When as.df=FALSE (default), returns a KorAPCorpusStats object with these values as slots.

Arguments

kco

KorAPConnection() object (obtained e.g. from KorAPConnection()

vc

string describing the virtual corpus. An empty string (default) means the whole corpus, as far as it is license-wise accessible.

verbose

logical. If TRUE, additional diagnostics are printed.

as.df

return result as data frame instead of as S4 object?

Usage

# Get statistics for entire corpus
kcon <- KorAPConnection()
stats <- corpusStats(kcon)

# Get statistics for a specific time period stats <- corpusStats(kcon, "pubDate in 2020")

# Access the number of tokens stats@tokens

Examples

Run this code
if (FALSE) {

kco <- KorAPConnection()

# Get statistics for entire corpus (returns S4 object)
stats <- corpusStats(kco)
stats@tokens  # Access number of tokens

# Get statistics for newspaper texts from 2017 (as data frame)
df <- corpusStats(kco, "pubDate in 2017 & textType=/Zeitung.*/", as.df = TRUE)
df$documents  # Access number of documents

# Compare corpus sizes across years
years <- 2015:2020
sizes <- sapply(years, function(y) {
  corpusStats(kco, paste("pubDate in", y))@tokens
})
}

Run the code above in your browser using DataLab