corpus: constructor for corpus objects

Description

Creates a corpus from a document source. The current available document sources are:

a character vector (as in R classchar) of texts;
acorpusSource-classobject, constructed usingtextfile;
atmVCorpusclass corpus object, meaning that anything you can use to create atmcorpus, including all of the tm plugins plus the built-in functions of tm for importing pdf, Word, and XML documents, can be used to create a quantedacorpus.

Corpus-level meta-data can be specified at creation, containing (for example) citation information and notes, as can document-level variables and document-level meta-data.

Usage

corpus(x, ...)
## S3 method for class 'character':
corpus(x, enc = NULL, docnames = NULL, docvars = NULL,
  source = NULL, notes = NULL, citation = NULL, ...)
## S3 method for class 'corpusSource':
corpus(x, enc = NULL, notes = NULL,
  citation = NULL, ...)
## S3 method for class 'VCorpus':
corpus(x, enc = NULL, notes = NULL, citation = NULL,
  ...)
is.corpus(x)
## S3 method for class 'corpus':
+(c1, c2)

Arguments

a source of texts to form the documents in the corpus, a character vector or a corpusSource-class object created using textfile.

...

additional arguments

enc

A string specifying the input encoding for texts in the corpus. Must be a valid entry in iconvlist(), since the code in corpus.character will convert this to UTF-8 using <

docnames

Names to be assigned to the texts, defaults to the names of the character vector (if any), otherwise assigns "text1", "text2", etc.

docvars

A data frame of attributes that is associated with each text.

source

A string specifying the source of the texts, used for referencing.

notes

A string containing notes about who created the text, warnings, To Dos, etc.

citation

Information on how to cite the corpus.

corpus one to be added

corpus two to be added

Value

A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus. A corpus consists of a list of elements described below, although these should only be accessed through accessor and replacement functions, not directly (since the internals may be subject to change). The structure of a corpus classed list object is:
$documentsA data frame containing the document level information, consisting of texts, user-named docvars variables describing attributes of the documents, and metadoc document-level metadata whose names begin with an underscore character, such as _language.
$metadataA named list set of corpus-level meta-data, including source and created (both generated automatically unless assigned), notes, and citation.
$settingsSettings for the corpus which record options that govern the subsequent processing of the corpus when it is converted into a document-feature matrix (dfm). See settings.
$tokensAn indexed list of tokens and types tabulated by document, including information on positions. Not yet fully implemented.
is.corpus returns TRUE if the object is a corpus

Details

The + operator for a corpus object will combine two corpus objects, resolving any non-matching docvars or metadoc fields by making them into NA values for the corpus lacking that field. Corpus-level meta data is concatenated, except for source and notes, which are stamped with information pertaining to the creation of the new joined corpus.

There are some issues that need to be addressed in future revisions of quanteda concerning the use of factors to store document variables and meta-data. Currently most or all of these are not recorded as factors, because we use stringsAsFactors=FALSE in the data.frame calls that are used to create and store the document-level information, because the texts should always be stored as character vectors and never as factors.

Examples

Run this code

#
# create a corpus from texts
corpus(inaugTexts)

# create a corpus from texts and assign meta-data and document variables
ukimmigCorpus <- corpus(ukimmigTexts,
                            docvars=data.frame(party=names(ukimmigTexts)),
                            enc="UTF-8")
# the fifth column of this csv file is the text field
mytexts <- textfile("http://www.kenbenoit.net/files/text_example.csv", textField=5)
str(mytexts)
mycorp <- corpus(mytexts)
mycorp2 <- corpus(textfile("http://www.kenbenoit.net/files/text_example.csv", textField="Title"))
identical(texts(mycorp), texts(mycorp2))
identical(docvars(mycorp), docvars(mycorp2))
#
## import a tm VCorpus
if (require(tm)) {
    data(crude)    # load in a tm example VCorpus
    mytmCorpus <- corpus(crude)
    summary(mytmCorpus, showmeta=TRUE)
}

Run the code above in your browser using DataLab