Learn R Programming

quanteda (version 0.9.2-0)

corpus: constructor for corpus objects

Description

Creates a corpus from a document source. The current available document sources are:
  • a character vector (as in R classchar) of texts;
  • acorpusSource-classobject, constructed usingtextfile;
  • atmVCorpusclass corpus object, meaning that anything you can use to create atmcorpus, including all of the tm plugins plus the built-in functions of tm for importing pdf, Word, and XML documents, can be used to create a quantedacorpus.
Corpus-level meta-data can be specified at creation, containing (for example) citation information and notes, as can document-level variables and document-level meta-data.

Usage

corpus(x, ...)

## S3 method for class 'character': corpus(x, enc = NULL, encTo = "UTF-8", docnames = NULL, docvars = NULL, source = NULL, notes = NULL, citation = NULL, ...)

## S3 method for class 'corpusSource': corpus(x, ...)

## S3 method for class 'VCorpus': corpus(x, ...)

is.corpus(x)

## S3 method for class 'corpus': +(c1, c2)

## S3 method for class 'corpus': [(x, i, j = NULL, ..., drop = TRUE)

Arguments

x
a source of texts to form the documents in the corpus, a character vector or a corpusSource-class object created using textfile.
...
additional arguments
enc
a string specifying the input encoding for texts in the corpus. Must be a valid entry in stri_enc_list(), since the code in corpus.character will convert this to encTo
encTo
target encoding, default is UTF-8. Unless you have strong reasons to use an alternative encoding, we strongly recommend you leave this at its default. Must be a valid entry in stri_enc_list()
docnames
Names to be assigned to the texts, defaults to the names of the character vector (if any), otherwise assigns "text1", "text2", etc.
docvars
A data frame of attributes that is associated with each text.
source
A string specifying the source of the texts, used for referencing.
notes
A string containing notes about who created the text, warnings, To Dos, etc.
citation
Information on how to cite the corpus.
c1
corpus one to be added
c2
corpus two to be added
i
index for documents or rows of document variables
j
index for column of document variables
drop
if TRUE the result is coerced to the lowest possible dimension (see the examples). This only works for extracting elements, not for the replacement. See drop for further details.

Value

  • A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus. A corpus consists of a list of elements described below, although these should only be accessed through accessor and replacement functions, not directly (since the internals may be subject to change). The structure of a corpus classed list object is:
  • $documentsA data frame containing the document level information, consisting of texts, user-named docvars variables describing attributes of the documents, and metadoc document-level metadata whose names begin with an underscore character, such as _language.
  • $metadataA named list set of corpus-level meta-data, including source and created (both generated automatically unless assigned), notes, and citation.
  • $settingsSettings for the corpus which record options that govern the subsequent processing of the corpus when it is converted into a document-feature matrix (dfm). See settings.
  • $tokensAn indexed list of tokens and types tabulated by document, including information on positions. Not yet fully implemented.
  • is.corpus returns TRUE if the object is a corpus

Details

The texts and document variables of corpus objects can also be accessed using index notation. Indexing a corpus object as a vector will return its text, equivalent to texts(x). Indexing a corpus using two indexes (integers or column names) will return the document variables, equivalent to docvars(x).

The + operator for a corpus object will combine two corpus objects, resolving any non-matching docvars or metadoc fields by making them into NA values for the corpus lacking that field. Corpus-level meta data is concatenated, except for source and notes, which are stamped with information pertaining to the creation of the new joined corpus. There are some issues that need to be addressed in future revisions of quanteda concerning the use of factors to store document variables and meta-data. Currently most or all of these are not recorded as factors, because we use stringsAsFactors=FALSE in the data.frame calls that are used to create and store the document-level information, because the texts should always be stored as character vectors and never as factors.

See Also

docvars, metadoc, metacorpus, settings, texts

Examples

Run this code
# create a corpus from texts
corpus(inaugTexts)

# create a corpus from texts and assign meta-data and document variables
ukimmigCorpus <- corpus(ukimmigTexts, 
                        docvars = data.frame(party=names(ukimmigTexts)), 
                        encTo = "UTF-16") 

corpus(texts(ie2010Corpus))

# the fifth column of this csv file is the text field
mytexts <- textfile("http://www.kenbenoit.net/files/text_example.csv", textField = 5)
mycorp <- corpus(mytexts)
mycorp2 <- corpus(textfile("http://www.kenbenoit.net/files/text_example.csv", textField = "Title"))
identical(texts(mycorp), texts(mycorp2))
identical(docvars(mycorp), docvars(mycorp2))
# import a tm VCorpus
if ("tm" %in% rownames(installed.packages())) {
    data(crude, package = "tm")    # load in a tm example VCorpus
    mytmCorpus <- corpus(crude)
    summary(mytmCorpus, showmeta=TRUE)
    
    data(acq, package = "tm")
    summary(corpus(acq), 5, showmeta=TRUE)
    
    tmCorp <- tm::VCorpus(tm::VectorSource(inaugTexts[49:57]))
    quantCorp <- corpus(tmCorp)
    summary(quantCorp)
}

Run the code above in your browser using DataLab