maxent: trains a maximum entropy model given a training matrix and a vector or factor of labels.

Description

Trains a multinomial logistic regression model of class maxent-class given a matrix or matrix.csr with training data, and a vector or factor with corresponding labels. Additional parameters such as feature_cutoff, gaussian_prior, inequality_constraints, and set_heldout help prevent model overfitting.

Usage

maxent(feature_matrix, code_vector, l1_regularizer=0.0, l2_regularizer=0.0, use_sgd=FALSE, set_heldout=0, verbose=FALSE)

Arguments

feature_matrix

A DocumentTermMatrix or TermDocumentMatrix (package tm), Matrix (package Matrix), matrix.csr (SparseM), data.frame, or matrix.

code_vector

A factor or vector of labels corresponding to each document in the feature_matrix.

l1_regularizer

An numeric turning on L1 regularization and setting the regularization parameter. A value of 0 will disable L1 regularization.

l2_regularizer

An numeric turning on L2 regularization and setting the regularization parameter. A value of 0 will disable L2 regularization.

use_sgd

A logical indicating that SGD parameter estimation should be used. Defaults to FALSE.

set_heldout

An integer specifying the number of documents to hold out. Sets a held-out subset of your data to test against and prevent overfitting.

verbose

A logical specifying whether to provide descriptive output about the training process. Defaults to FALSE, or no output.

Value

model: A character vector containing the trained maximum entropy model.
weights: A data.frame listing all the weights in three columns: Weight, Label, and Feature.

Details

Yoshimasa Tsuruoka recommends using one of following three methods if you see overfitting.

1. Set the l1_regularizer parameter to 1.0, leaving l2_regularizer and set_heldout as default. 2. Set the l2_regularizer parameter to 1.0, leaving l1_regularizer and set_heldout as default. 3. Set the set_heldout parameter to hold-out a portion of your data, leaving l1_regularizer and l2_regularizer as default.

If you are using a large number of training samples, try setting the use_sgd parameter to TRUE.

References

Y. Tsuruoka. "A simple C++ library for maximum entropy classification." University of Tokyo Department of Computer Science (Tsujii Laboratory), 2011. URL http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/maxent/.

Examples

Run this code

# LOAD LIBRARY
library(maxent)

# READ THE DATA, PREPARE THE CORPUS, and CREATE THE MATRIX
data <- read.csv(system.file("data/NYTimes.csv.gz",package="maxent"))
corpus <- Corpus(VectorSource(data$Title[1:150]))
matrix <- DocumentTermMatrix(corpus)

# TRAIN USING SPARSEM REPRESENTATION
sparse <- as.compressed.matrix(matrix)
model <- maxent(sparse[1:100,],as.factor(data$Topic.Code)[1:100])

# A DIFFERENT EXAMPLE (taken from package e10711)
# CREATE DATA
x <- seq(0.1, 5, by = 0.05)
y <- log(x) + rnorm(x, sd = 0.2)

# ESTIMATE MODEL AND PREDICT INPUT VALUES
m <- maxent(x, y)
new <- predict(m, x)

# VISUALIZE
plot(x, y)
points(x, log(x), col = 2)
points(x, new[,1], col = 4)

Run the code above in your browser using DataLab