Learn R Programming

RTextTools (version 1.3.8)

create_corpus: creates a corpus for training, classifying, and analyzing documents.

Description

Given a DocumentTermMatrix from the tm package and corresponding document labels, creates a corpus of class matrix_container-class that can be used for training and classification (i.e. train_model, train_models, classify_model, classify_models)

Usage

create_corpus(matrix, labels, trainSize=NULL, testSize=NULL, virgin)

Arguments

matrix
A document-term matrix of class DocumentTermMatrix or TermDocumentMatrix from the tm package, or generated by create_matrix.
labels
A factor or vector of labels corresponding to each document in the matrix.
trainSize
A range (e.g. 1:1000) specifying the number of documents to use for training the models. Can be left blank for classifying corpora using saved models that don't need to be trained.
testSize
A range (e.g. 1:1000) specifying the number of documents to use for classification. Can be left blank for training on all data in the matrix.
virgin
A logical (TRUE or FALSE) specifying whether to treat the classification data as virgin data or not.

Value

Examples

Run this code
library(RTextTools)
data <- read_data(system.file("data/NYTimes.csv.gz",package="RTextTools"),type="csv")
data <- data[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data$Title,data$Subject), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
corpus <- create_corpus(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)

Run the code above in your browser using DataLab