logreg.myown: Writing your own Logic Regression scoring function

Description

Help file for writing your own scoring function for logreg!

Usage

logreg.myown()

Arguments

Details

You can write your own scoring function for logreg! This may be useful if you have a model other than those which we already programmed in.

Essentially you need to provide two routines in the file My\_own\_scoring.f: (i) A routine My\_own\_fitting which fits your model: it provides a coefficient (beta) for each of the logic trees and provides a score of how good the model is. Low scores are good. (So add a minus sign if your score is a log-likelihood.) (ii) A routine My\_own\_scoring which - given the betas - provides the score of your model. [If you don't use cross-validation, this second routine is not needed, though some dummy routine to satisfy the compiler should still be provided.]

After recompilation, you can fit your model using the option type = 0 in logreg. Below we give an example for a version of the My.own functions for conditional logistic regression which are also provided as inst/condlogic.ff when you downloaded the files.

PROGRAMMING DETAILS

Below is a list of variables that are passed on. Most of them are as you expect - response, predictors (binary ones and continuous ones), number of cases, number of predictors. In addition there are two columns - dcph and weight - that can either be used to pass on an auxiliary variable for each case (discrete for dcph and continuous for weight), or even some overall auxiliary variables - as these numbers are not used anywhere else. If you do not need any of the variables - just ignore them!

prtr: the predictions of the logic trees in the current model: this is an integer matrix of size n1 times ntr - although only the first nop columns contain useful information.

rsp: the response variable: this is a real (single precision) vector of length n1.

dcph: censor times: this is an integer vector of length n1 this could be used as an auxiliary (integer) vector - as it is just passed on. (There is no check that this is a 0/1 variable, when you use your own scoring function.) For example, you could use this to pass on something like cluster membership.

weight: weights for the cases this is a real vector of length n1. this could be used as an auxiliary (real) vector - as it is just passed on. There is no check that these numbers are positive, when you choose your own scoring function.

ordrs: the order (by response size) of the cases This is an integer vector of length n1. For the case with the smallest response this one is 1, for the second smallest 2, and so on. Ties are resolved arbitrary. Always computed, although only used for proportional hazards models. Use it as you wish.

n1: the total number of cases in the data.

ntr: the number of logic trees ALLOWED in the tree.

nop: the number of logic trees in the CURRENT model. The subroutines should work if nop is 0.

wh: the index of the tree that has been edited in the last move - i.e. the column of prtr that has changes since the last call.

nsep: number of variables that get fit a separate parameter The subroutines should work if nsep is 0.

seps: array of the above variables - this is a single precision matrix of size nsep times n1. Note that seps and prtr are stored in different directions.

For My\_own\_fitting you should return: betas: a vector of parameters of the model that you fit. betas(0) should be the parameter for the intercept betas(1:nsep) should be the parameters for the continuous variables in seps betas((nsep+1):(nsep+nop)) should be the parameters for the binary trees in prtr if you have more parameters, use dcph, or weight; these variables will not be printed however.

score: whatever score you assign to your model small should be good (i.e. deviance or -log.likelihood).

reject: an indicator whether or not to reject the proposed move *regardless* of the score (for example when an iteration necessary to determine the score failed to converge (0 = move is OK ; 1 = reject move) set this one to 0 if there is no such condition.

You are allowed to change the values of dcph, and weight.

For My\_own\_scoring additional input is: betas: the coefficients

You should return: score: whatever score you assign to your model small should be good (i.e. deviance or -log.likelihood). If the model "crashes", you should simply return a very large number.

While we try to prevent that models are singular, it is possible that for your model a single or degenerate model is passed on for evaluation. For My\_own\_fitting you can pass the model back with reject = 1, for My\_own\_scoring you can pass it on with a very large value for score. Currently My\_own\_scoring.f contains empty frames for the scoring functions; condlogic.ff contains an example with conditional logistic regression.

The logic regression program is written in Fortran 77.

CONDITIONAL LOGISTIC REGRESSION

A function for a conditional logistic regression score function is attached as an example function on how to write your own scoring function for Logic Regression. Obviously you can also use it if you have conditional logistic data.

Conditional logistic regression is common model fitting technique for matched case-control studies, in which each case is matched with one or more controls. (In conditional logistic regression several cases could be matched to several controls, in the implementation provided here only one case can be matched with each group of controls.) Conditional logistic regression models are parameterized like regular logistic regression models, except that the intercept is no longer identifiable. See, for example, Breslow and Day - Volume 1 (1990, Statistical Methods in Cancer Research, International Agency for Research on Cancer, Lyon) for details. Conditional logistic regression models are most easily fit using a stratified proportional hazards model (if there is one-to-one case-control matching it can also be fit using logistic regression, but that method breaks down if there is more than one control per case). Each group of a case and controls is one stratum. All cases get an arbitrarily event time of 1.00, and all controls get a censoring time of 2.00.

In our implementation we use the response column to indicate the matching. For all controls this column is 0, for a case it is k, indicating that the next k records are the matched controls for the current case. Thus, we order our cases so that each case is followed by its controls. Cases with a negative response are put in a stratum -1, which is not used in any computations. This has implications for cross-validation. See below.

In My\_own\_fitting and My\_own\_scoring we first allocate various vectors (strata, index, censoring variable) that are local, as well as some work arrays that are used by our fitting routines. (We need to set some of the parameters for that, see the help page of logreg for details.) We then define idx(j)=j for j=1,n1, and we define the strata and delta vectors. We use slightly modified versions of the proportional hazards routines that are already used otherwise in the Logic Regression program, to include stratification. After the model is fitted, we assign minus the partial likelihood to score(1) and (for My\_own\_fitting) we pass on the betas.

Recompile after replacing My\_own\_scoring.f by condlogic.ff

The permutation and null model versions are not directly usable (we could do some permutation tests, but they require more programming), but we can use cross-validation. Obviously we should keep cases and controls match. To that extend, we would run permutation with a negative seed (see logreg) and we would take care ourselves that case-control groups are in a random order, and that every block has the same number of records. We achieve the later by adding some records with response -1. In particular, suppose that we have 19 pairs of case- (single) control data, and that we want to do 3-fold cross validation. We would permute the sequence of the 19 pairs, and add two records with response -1 after the 13th pair, and two records with -1 at the end of the file, so that the total data file would have 42 records.

References

Ruczinski I, Kooperberg C, LeBlanc ML (2003). Logic Regression, Journal of Computational and Graphical Statistics, 12, 475-511.

Ruczinski I, Kooperberg C, LeBlanc ML (2002). Logic Regression - methods and software. Proceedings of the MSRI workshop on Nonlinear Estimation and Classification (Eds: D. Denison, M. Hansen, C. Holmes, B. Mallick, B. Yu), Springer: New York, 333-344.

Kooperberg C, Ruczinski I, LeBlanc ML, Hsu L (2001). Sequence Analysis using Logic Regression, Genetic Epidemiology, 21, S626-S631.

Selected chapters from the dissertation of Ingo Ruczinski, available from http://kooperberg.fhcrc.org/logic/documents/ingophd-logic.pdf

Examples

Run this code

# NOT RUN {
logreg.myown()      # displays this help file
help(logreg.myown)  # equivalent
# }

Run the code above in your browser using DataLab