logicFS: Feature Selection with Logic Regression

Description

Identification of interesting interactions between binary variables using logic regression. Currently available for the classification, the linear regression and the logistic regression approach of logreg and for a multinomial logic regression as implemented in mlogreg.

Usage

"logicFS"(x, y, B = 100, useN = TRUE, ntrees = 1, nleaves = 8,  glm.if.1tree = FALSE, replace = TRUE, sub.frac = 0.632,  anneal.control = logreg.anneal.control(), onlyRemove = FALSE, prob.case = 0.5, addMatImp = TRUE, fast = FALSE, rand = NULL, ...) 
"logicFS"(formula, data, recdom = TRUE, ...)

Arguments

a matrix consisting of 0's and 1's. Each column must correspond to a binary variable and each row to an observation. Missing values are not allowed.

a numeric vector or a factor specifying the values of a response for all the observations represented in x, where missing values are not allowed in y. If a numeric vector, then y either contains the class labels (coded by 0 and 1) or the values of a continuous response depending on whether the classification or logistic regression approach of logic regression, or the linear regression approach, respectively, should be used. If the response is categorical, then y must be a factor naming the class labels of the observations.

an integer specifying the number of iterations.

useN

logical specifying if the number of correctly classified out-of-bag observations should be used in the computation of the importance measure. If FALSE, the proportion of correctly classified oob observations is used instead.

ntrees

an integer indicating how many trees should be used. For a binary response: If ntrees is larger than 1, the logistic regression approach of logic regreesion will be used. If ntrees is 1, then by default the classification approach of logic regression will be used (see glm.if.1tree.) For a continuous response: A linear regression model with ntrees trees is fitted in each of the B iterations. For a categorical response: $n.lev-1$ logic regression models with ntrees trees are fitted, where $n.lev$ is the number of levels of the response (for details, see mlogreg).

nleaves

a numeric value specifying the maximum number of leaves used in all trees combined. For details, see the help page of the function logreg of the package LogicReg.

glm.if.1tree

if ntrees is 1 and glm.if.1tree is TRUE the logistic regression approach of logic regression is used instead of the classification approach. Ignored if ntrees is not 1, or the response is not binary.

replace

should sampling of the cases be done with replacement? If TRUE, a Bootstrap sample of size length(cl) is drawn from the length(cl) observations in each of the B iterations. If FALSE, ceiling(sub.frac * length(cl)) of the observations are drawn without replacement in each iteration.

sub.frac

a proportion specifying the fraction of the observations that are used in each iteration to build a classification rule if replace = FALSE. Ignored if replace = TRUE.

anneal.control

a list containing the parameters for simulated annealing. See the help of the function logreg.anneal.control in the LogicReg package.

onlyRemove

should in the single tree case the multiple tree measure be used? If TRUE, the prime implicants are only removed from the trees when determining the importance in the single tree case. If FALSE, the original single tree measure is computed for each prime implicant, i.e.\ a prime implicant is not only removed from the trees in which it is contained, but also added to the trees that do not contain this interaction. Ignored in all other than the classification case.

prob.case

a numeric value between 0 and 1. If the outcome of the logistic regression, i.e.\ the predicted probability, for an observation is larger than prob.case this observations will be classified as case (or 1).

addMatImp

should the matrix containing the improvements due to the prime implicants in each of the iterations be added to the output? (For each of the prime implicants, the importance is computed by the average over the B improvements.) Must be set to TRUE, if standardized importances should be computed using vim.norm, or if permutation based importances should be computed using vim.signperm.

fast

should a greedy search (as implemented in logreg) be used instead of simulated annealing?

rand

numeric value. If specified, the random number generator will be set into a reproducible state.

formula

an object of class formula describing the model that should be fitted.

data

a data frame containing the variables in the model. Each row of data must correspond to an observation, and each column to a binary variable (coded by 0 and 1) or a factor (for details, see recdom) except for the column comprising the response, where no missing values are allowed in data. The response must be either binary (coded by 0 and 1), categorical or continuous. If continuous, a linear model is fitted in each of the B iterations of logicFS. If categorical, the column of data specifying the response must be a factor. In this case, multinomial logic regressions are performed as implemented in mlogreg. Otherwise, depending on ntrees (and glm.if.1tree) the classification or the logistic regression approach of logic regression is used.

recdom

a logical value or vector of length ncol(data) comprising whether a SNP should be transformed into two binary dummy variables coding for a recessive and a dominant effect. If recdom is TRUE (and a logical value), then all factors/variables with three levels will be coded by two dummy variables as described in make.snp.dummy. Each level of each of the other factors (also factors specifying a SNP that shows only two genotypes) is coded by one indicator variable. If recdom isFALSE (and a logical value), each level of each factor is coded by an indicator variable. If recdom is a logical vector, all factors corresponding to an entry in recdom that is TRUE are assumed to be SNPs and transformed into two binary variables as described above. All variables corresponding to entries of recdom that are TRUE (no matter whether recdom is a vector or a value) must be coded either by the integers 1 (coding for the homozygous reference genotype), 2 (heterozygous), and 3 (homozygous variant), or alternatively by the number of minor alleles, i.e. 0, 1, and 2, where no mixing of the two coding schemes is allowed. Thus, it is not allowed that some SNPs are coded by 1, 2, and 3, and others are coded by 0, 1, and 2.

...

for the formula method, optional parameters to be passed to the low level function logicFS.default. Otherwise, ignored.

Value

primes: the prime implicants,
vim: the importance of the prime implicants,
prop: the proportion of logic regression models that contain the prime implicants,
type: the type of model (1: classification, 2: linear regression, 3: logistic regression),
param: further parameters (if addInfo = TRUE),
mat.imp: the matrix containing the improvements if addMatImp = TRUE, otherwise, NULL,
measure: the name of the used importance measure,
useN: the value of useN,
threshold: NULL,
mu: NULL.

References

Ruczinski, I., Kooperberg, C., LeBlanc M.L. (2003). Logic Regression. Journal of Computational and Graphical Statistics, 12, 475-511. Schwender, H., Ickstadt, K. (2007). Identification of SNP Interactions Using Logic Regression. Biostatistics, 9(1), 187-198.

Description

Usage

Arguments

Value

References

See Also