pamCat: Prediction Analysis of Categorical Data

Description

Performs a Prediction Analysis of Categorical Data.

Usage

pamCat(data, cl, theta = NULL, n.theta = 10, newdata = NULL, newcl = NULL)

Arguments

data

a numeric matrix composed of the integers between 1 and \(n_{cat}\), where \(n_{cat}\) is the number of levels each of the variables represented by the rows of data must take. No missing values allowed.

a numeric vector of length ncol(data) comprising the class labels of the observations represented by the columns of data. cl must consist of the integers between 1 and \(n_{cl}\), where \(n_{cl}\) is the number of classes.

theta

a numeric vector consisting of the strictly positive values of the shrinkage parameter used in the Prediction Analysis. If NULL, a vector consisting of n.theta values for the shrinkage parameter are determined automatically.

n.theta

an integer specifying the number of values for the shrinkage parameter of the Prediction Analysis. Ignored if theta is specified.

newdata

a numeric matrix composed of the integers between 1 and \(n_{cat}\). Must have the same number of rows as data, and each row of newdata must contain the same variable as the corresponding row of data. newdata is employed to compute the misclassification rates of the Prediction Analysis for the given values of the shrinkage parameter. If NULL, data is used to determine the misclassification rates.

newcl

a numeric vector of length ncol(newdata) that consists of integers between 1 and \(n_{cl}\), and specifies the class labels of the observations in newdata. Must be specified, if newdata is specified.

Value

An object of class pamCat composed of

mat.chisq

a matrix with \(m\) rows and \(n_{cl}\) columns consisting of the classwise values of Pearson's \(\chi^2\) statistic for each of the \(m\) variables.

mat.obs

a matrix with \(m\) rows and \(n_{cat} * n_{cl}\) columns in which each row shows a contingency table between the corresponding variable and cl.

mat.exp

a matrix of the same size as mat.obs containing the numbers of observations expected under the null hypothesis of an association between the respective variable and cl.

mat.theta

a data frame consisting of the numbers of variables used in the classification of the observations in newdata and the corresponding misclassification rates for a set of values of the shrinkage parameter \(\theta\).

tab.cl

a table summarizing the values of the response, i.e.\ the class labels.

n.cat

\(n_{cat}\).

References

Schwender, H.\ (2007). Statistical Analysis of Genotype and Gene Expression Data. Dissertation, Department of Statistics, University of Dortmund.

Examples

Run this code

# NOT RUN {
# Generate a data set consisting of 2000 rows (variables) and 50 columns.
# Assume that the first 25 observations belong to class 1, and the other
# 50 observations to class 2.

mat <- matrix(sample(3, 100000, TRUE), 2000)
rownames(mat) <- paste("SNP", 1:2000, sep = "")
cl <- rep(1:2, e = 25)

# Apply PAM for categorical data to this matrix, and compute the
# misclassification rate on the training set, i.e. on mat.

pam.out <- pamCat(mat, cl)
pam.out

# Now generate a new data set consisting of 20 observations, 
# and predict the classes of these observations using the
# value of theta that has led to the smallest misclassification
# rate in pam.out.

mat2 <- matrix(sample(3, 40000, TRUE), 2000)
rownames(mat2) <- paste("SNP", 1:2000, sep = "")
predict(pam.out, mat2)

# Let's assume that the predicted classes are the real classes
# of the observations. Then, mat2 can also be used in pamCat
# to compute the misclassification rate. 

cl2 <- predict(pam.out, mat2)
pamCat(mat, cl, newdata = mat2, newcl = cl2)

# }