textfile: read a text corpus source from a file

Description

Read a text corpus from a source file, where the single file will consist of a set of texts in columns and document variables and document-level meta-data in additional columns. For spreadsheet-like files, the first row must be a header.

Usage

textfile(file, textField, encodingFrom = NULL, encodingTo = "UTF-8",
  cache = FALSE, docvarsfrom = c("filenames"), dvsep = "_",
  docvarnames = NULL, ...)

  ## S3 method for class 'character,index,missing,missing,ANY,missing,missing,missing':
textfile(file,
  textField, encodingFrom = NULL, encodingTo = "UTF-8", cache = FALSE,
  docvarsfrom = c("filenames"), dvsep = "_", docvarnames = NULL, ...)

  ## S3 method for class 'character,missing,ANY,ANY,ANY,missing,missing,missing':
textfile(file,
  textField, encodingFrom = NULL, encodingTo = "UTF-8", cache = FALSE,
  docvarsfrom = c("filenames"), dvsep = "_", docvarnames = NULL, ...)

  ## S3 method for class 'character,missing,missing,missing,ANY,character,ANY,ANY':
textfile(file,
  textField, encodingFrom = NULL, encodingTo = "UTF-8", cache = FALSE,
  docvarsfrom = c("filenames"), dvsep = "_", docvarnames = NULL, ...)

Arguments

file

the complete filename(s) to be read. The value can be a vector of file names, a single file name, or a file "mask" using a "glob"-type wildcard value. Currently available file value types are: [object Object],[object Object],[object Object],[object Obje

textField

a variable (column) name or column number indicating where to find the texts that form the documents for the corpus. This must be specified for file types .csv and .json.

encodingFrom

a single character value specifying the input file encoding, or a vector of character values where each element corresponds to a single file, if a filemask or multiple filenames are supplied as file. These work in the same was as the

encodingTo

an optional value that can specify the encoding you wish the files to be converted to, but we strongly encourage you to use the default of UTF-8.

cache

If TRUE, write the object to a temporary file and store the temporary filename in the corpusSource-class object definition. If FALSE, return the data in the object. Caching the fil

docvarsfrom

used to specify that docvars should be taken from the filenames, when the textfile inputs are filenames and the elements of the filenames are document variables, separated by a delimiter (dvsep). This allows easy assignment o

dvsep

separator used in filenames to delimit docvar elements if docvarsfrom="filenames" is used

docvarnames

character vector of variable names for docvars, if docvarsfrom is specified. If this argument is not used, default docvar names will be used (docvar1, docvar2, ...).

...

additional arguments passed through to other functions

Value

an object of class corpusSource-class that can be read by corpus to construct a corpus

Details

The constructor does not store a copy of the texts, but rather reads in the texts and associated data, and saves them to a temporary disk file whose location is specified in the corpusSource-class object. This prevents a complete copy of the object from cluttering the global environment and consuming additional space. This does mean however that the state of the file containing the source data will not be cross-platform and may not be persistent across sessions. So the recommended usage is to load the data into a corpus in the same session in which textfile is called.

Examples

Run this code

# Twitter json
mytf1 <- textfile("http://www.kenbenoit.net/files/tweets.json")
summary(corpus(mytf1), 5)
# generic json - needs a textField specifier
mytf2 <- textfile("http://www.kenbenoit.net/files/sotu.json",
                  textField = "text")
summary(corpus(mytf2))
# text file
mytf3 <- textfile("https://www.gutenberg.org/cache/epub/2701/pg2701.txt", cache = FALSE)
summary(corpus(mytf3))
# XML data
mytf6 <- textfile("http://www.kenbenoit.net/files/plant_catalog.xml", 
                  textField = "COMMON")
summary(corpus(mytf6))
# csv file
write.csv(data.frame(inaugSpeech = texts(inaugCorpus), docvars(inaugCorpus)), 
          file = "/tmp/inaugTexts.csv", row.names = FALSE)
mytf7 <- textfile("/tmp/inaugTexts.csv", textField = "inaugSpeech")
summary(corpus(mytf7))

# vector of full filenames for a recursive structure
textfile(list.files(path = "~/Desktop/texts", pattern = "\\.txt$", 
                    full.names = TRUE, recursive = TRUE))

Run the code above in your browser using DataLab