Learn R Programming

quanteda (version 0.7.2-1)

textfile: read a text corpus source from a file

Description

Read a text corpus from a source file, where the single file will consist of a set of texts in columns and document variables and document-level meta-data in additional columns. For spreadsheet-like files, the first row must be a header.

Usage

textfile(file, textField, directory = NULL, docvarsfrom = c("filenames"),
  sep = "_", docvarnames = NULL, ...)

## S3 method for class 'character,index,missing,missing,missing,missing': textfile(file, textField, directory = NULL, docvarsfrom = c("filenames"), sep = "_", docvarnames = NULL, ...)

## S3 method for class 'character,missing,missing,missing,missing,missing': textfile(file, textField, directory = NULL, docvarsfrom = c("filenames"), sep = "_", docvarnames = NULL, ...)

## S3 method for class 'character,missing,missing,character,ANY,ANY': textfile(file, textField = NULL, directory = NULL, docvarsfrom = c("headers"), sep = "_", docvarnames = NULL, ...)

Arguments

file
the complete filename to be read. Currently available file types are: [object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
textField
a variable (column) name or column number indicating where to find the texts that form the documents for the corpus. This must be specified for file types .csv and .json.
directory
not used yet, and may be removed (if I move this to a new method called textfiles)
docvarsfrom
used to specify that docvars should be taken from the filenames, when the textfile inputs are filenames and the elements of the filenames are document variables, separated by a delimiter (sep). This allows easy assignment of doc
sep
separator used in filenames to delimit docvar elements if docvarsfrom="filenames" is used
docvarnames
character vector of variable names for docvars, if docvarsfrom is specified. If this argument is not used, default docvar names will be used (docvar1, docvar2, ...).
...
additional arguments passed through to other functions

Value

Details

The constructor does not store a copy of the texts, but rather reads in the texts and associated data, and saves them to a temporary R object whose location is specified in the corpusSource-class object. This prevents a complete copy of the object from cluttering the global environment and consuming additional space. This does mean however that the state of the file containing the source data will not be cross-platform and may not be persistent across sessions. So the recommended usage is to load the data into a corpus in the same session in which textfile is called.

Examples

Run this code
# Twitter json
mytf <- textfile("~/Dropbox/QUANTESS/corpora/misc/NinTANDO_Me.json")
summary(corpus(mytf))
# generic json - needs a textField specifier
mytf2 <- textfile("~/Dropbox/QUANTESS/Manuscripts/Collocations/Corpora/sotu/sotu.json",
                  textField = "text")
summary(corpus(mytf2))
# text file
mytf3 <- textfile("~/Dropbox/QUANTESS/corpora/project_gutenberg/pg2701.txt")
summary(corpus(mytf3))
mytf4 <- textfile("~/Dropbox/QUANTESS/corpora/inaugural/*.txt")
summary(corpus(mytf4))
mytf5 <- textfile("~/Dropbox/QUANTESS/corpora/inaugural/*.txt",
                  docvarsfrom="filenames", sep="-", docvarnames=c("Year", "President"))
summary(corpus(mytf5))

Run the code above in your browser using DataLab