R Factor Utilities: R Factor Utilities

Description

Utilities to manipulate R factors, extending the ones in regtools.

Usage

levelCounts(data)
dataToTopLevels(data,lowCountThresholds)
factorToTopLevels(f,lowCountThresh=0)
cartesianFactor(dataName,factorNames,fNameSep = ".")
qeRareLevels(x, yName, yesYVal = NULL)

Arguments

data: A data frame or equivalent.
f: An R factor.
lowCountThresh: Factor levels will counts below this value will not be used for this factor.
lowCountThresholds: An R list of column names and their corresponding values of lowCountThresh.
dataName: A quoted name of a data frame or equivalent.
factorNames: A vector of R factor names.
fNameSep: A character to be used as a delimiter in the names of the levels of the output factor.
x: A data frame.
yName: Quoted name of the response variable.
yesYVal: In the case of binary Y, the factor level to be considered positive.

Author

Norm Matloff

Details

Often one has an R factor in which one or more levels are rare in the data. This could cause problems, say in performing cross-validation; a level in the test set might be "new," not having appeared in the training set. Toward this end, factorToTopLevels will remove rare levels from a factor; dataToTopLevels applies this to an entire data frame.

Also toward this end, the function levelCounts simply applies table() to each column of data, returning the result as an R list. (If more than 10 levels, it returns NA.

The function cartesianFactor generates a "superfactor" from individual ones; e.g. if factors f1 and f2 have n1 and n2 levels, the output is a new factor with n1 * n2 levels.

The function qeRareLevels checks all columns in a data frame in terms of being an R factor with rare levels.

Examples

Run this code


data(svcensus)
levelCounts(svcensus)  # e.g. finds there are 15182 men, 4908 women
f1 <- svcensus$gender  # 2 levels
f2 <- svcensus$occ  # 6 levels
z <- cartesianFactor('svcensus',c('gender','occ'))
head(z)
# [1] female.102 male.101   female.102 male.100   female.100 male.100  
# 12 Levels: female.100 female.101 female.102 female.106 ... male.141

Run the code above in your browser using DataLab