profRegr: Profile Regression

Description

Fit a profile regression model.

Usage

profRegr(covNames, fixedEffectsNames, outcome="outcome", 
        outcomeT=NA, data, output="output", hyper, predict, 
        nSweeps=1000, nBurn=1000, nProgress=500, nFilter=1, 
        nClusInit, seed, yModel="Bernoulli", xModel="Discrete", 
        sampler="SliceDependent", alpha=-2, dPitmanYor = 0, excludeY=FALSE, 
        extraYVar=FALSE, varSelectType="None", entropy,reportBurnIn=FALSE,
        run=TRUE, discreteCovs, continuousCovs, whichLabelSwitch="123")

Arguments

covNames

A vector of strings of the covariate names as by the column names in the data argument.

fixedEffectsNames

A vector of strings of the fixed effect names as by the column names in the data argument. Each fixed effect must be of class 'numeric'. If a fixed effect is of class 'character', an error message will appear and the fixed effect will need to be recoded a

outcome

A string of column of the data argument that contains the outcome. The outcome cannot have missing values - you could consider predicting the value of the outcome for those subjects for which it has not been observed.

outcomeT

A string of column of the data argument that contains the offset (for Poisson outcome) or the number of trials (for Binomial outcome).

data

A data frame which has as columns the outcome, the covariates, the fixed effects if any and the offset (for Poisson outcome) or the number of trials (for Binomial outcome) or censoring (for Survival outcome). The outcome cannot have missing values - you c

output

Path to folder to save all output files. The covariates can have missing values, which must be coded as 'NA'. There cannot be missing values in the fixed effects - if there are, use an imputation method before using profile regression.

hyper

Object of type setHyperparams with hyperparameters specifications. This is optional, default values are provided for all hyperparameters. See ?setHyperparams for details.

predict

Data frame containing the predictive scenarios. This is only required if predictions are requested.

At each iteration the predictive subjects are assigned to one of the current clusters according to their covariate profiles (but ignoring missing values)

nSweeps

Number of iterations of the MCMC after the burn-in period. By default this is 1000.

nBurn

Number of initial iterations of the MCMC to be discarded. By default this is 1000.

reportBurnIn

If TRUE then the burn in iterations are reported in the output files, if set to FALSE they are not. It is set to FALSE by default.

nProgress

The number of sweeps at which to print a progress update. By default this is 500.

nFilter

The frequency (in sweeps) with which to write the output to file. The default value is 1.

nClusInit

The number of clusters individuals should be initially randomly assigned to (Unif[50,60]).

seed

The value for the seed for the random number generator. The default value is the current time.

yModel

The model type for the outcome variable. The options currently available are "Bernoulli", "Poisson", "Binomial", "Categorical", "Normal" and "Survival". The default value is Bernoulli.

xModel

The model type for the covariates. The options currently available are "Discrete", "Normal" and "Mixed". The default value is "Discrete".

sampler

The sampler type to be used. Options are "SliceDependent", "SliceIndependent" and "Truncated". The default value is "SliceDependent".

alpha

The value to be used if alpha is fixed. If a value smaller than or equal to -1 is used then alpha is random, if dPitmanYor is equal to zero (the random alpha option is available for Dirichlet process prior only). The default value is -2 (random alpha). Fo

dPitmanYor

The discount parameter for the Pitman-Yor process prior. The default value is 0, which is equivalent to a Dirichlet process prior. This parameter must belong to the interval [0,1) and it must be provided together with a non-negative value for alpha. The P

excludeY

If TRUE only the covariate data X is modelled. By default this is set to FALSE.

extraYVar

If set equal to TRUE extra Gaussian variance is included in the response model. This option is available only for Bernoulli, Binomial and Poisson response. By default the extra Gaussian variance is not included, so extraYVar=FALSE.

varSelectType

The type of variable selection to be used "None", "BinaryCluster" or "Continuous". The "BinaryCluster" variable selection is the implementation of the novel variable selection formulation proposed by Papathomas, Molitor, Hoggart, Hastie, Richardson (2012

entropy

If included then we compute allocation entropy. By default the allocation entropy is not included.

run

Logical. If TRUE then the MCMC is run. Set run=FALSE if the MCMC has been run already and it is only required to collect information about the run.

discreteCovs

The names of the discrete covariates among the covariate names, if xModel="Mixed". This and continuousCovs must be defined if xModel="Mixed", while covNames is ignored.

continuousCovs

The names of the discrete covariates among the covariate names, if xModel="Mixed". This and continuousCovs must be defined if xModel="Mixed", while covNames is ignored.

whichLabelSwitch

The label switching moves to run. The options available are moves 1, 2 and 3 ("123"), moves 1 and 2 ("12") and move 3 only ("3"). The moves are described in Hastie et al. (2013). Note that the third label switching move is only available for Dirichlet pro

Value

Once the C++ has completed the output from fitting the regression is stored in a number of text files in the directory specified. Files are produced containing the MCMC traces for all of the values of interest, along with a log file and files for monitoring the acceptance rates of the adaptive Metropolis Hastings moves.
It returns a number of files in the output directory as well as a list with the following elements. This an object of type runInfoObj.
directoryPathString. Directory path of the output files.
fileStemString. The
inputFileNameString. Location and file name of input dataset as created by this function for the C++ routines
nSweepsInteger. The number of sweeps of the MCMC after the burn-in.
nBurnInteger. The number of iterations in the burn-in period of the MCMC.
reportBurnInLogical. Whether the output of the burn-in report should be included.
nFilterInteger. The frequency (in sweeps) with which to write the output to file.
nProgressThe number of sweeps at which to print a progress update.
nSubjectsInteger. The number of subjects.
nPredictSubjectsInteger. The number of subjects for which to run predictions.
fullPredictFileLogical. It is FALSE by default. It is equal to TRUE if the outcome or the outcome and the fixed effects were included in the dataframe provided in the input predict. If TRUE, the function will have a produced a file ending in "_predictFull.txt" which contains the values of the outcome and fixed effects for the computation of measures of fit in the function calcPredictions.
covNamesA vector of strings with the names of the covariates.
xModelString. The model type for the covariates.
includeResponseLogical. If FALSE only the covariate data X is modelled.
yModelString. The model type for the outcome.
varSelectLogical. If FALSE no variable selection is performed.
varSelectTypeString. It specifies what type of variable selection has been performed, if any.
nCovariatesInteger. The number of covariates.
nFixedEffectsInteger. The number of fixed effects.
nCategoriesYInteger. The number of categories of the outcome, if yModel = "Categorical". It is 1 otherwise.
nCategoriesVector of integers. The number of categories of each covariate, if xModel = "Discrete". It is 1 otherwise.
extraYVarTRUE if extra Gaussian variance is included in the response model.
xMatA matrix of the covariate data.
yMatA matrix of the outcome data, including the offset if the outcome is Poisson, the number of trials if the outcome is Binomial and 0 or 1 for Survival outcome (1 for censored individuals, 0 otherwise).
wMatA matrix of the fixed effect data.
whichLabelSwitchThe label switching moves that have been run. The options available are moves 1, 2 and 3 ("123"), moves 1 and 2 ("12") and move 3 only ("3"). The moves are described in Hastie et al. (2013).

Authors

David Hastie, Department of Epidemiology and Biostatistics, Imperial College London, UK

Silvia Liverani, Department of Epidemiology and Biostatistics, Imperial College London and MRC Biostatistics Unit, Cambridge, UK

and a contribution for mixed covariates by Lamiae Azizi, MRC Biostatistics Unit, Cambridge, UK

Maintainer: Silvia Liverani

References

Liverani, S., Hastie, D. I., Azizi, L., Papathomas, M. and Richardson, S. (2013) PReMiuM: An R package for Profile Regression Mixture Models using Dirichlet Processes. Submitted. Available at http://uk.arxiv.org/abs/1303.2836

Hastie, D. I., Liverani, S. and Richardson, S. (2014) Sampling from Dirichlet process mixture models with unknown concentration parameter: Mixing issues in large data implementations. Submitted. Available at http://uk.arxiv.org/abs/1304.1778

Examples

Run this code

# example for Poisson outcome and Discrete covariates
inputs <- generateSampleDataFile(clusSummaryPoissonDiscrete())
runInfoObj<-profRegr(yModel=inputs$yModel, 
    xModel=inputs$xModel, nSweeps=10, nClusInit=20,
    nBurn=20, data=inputs$inputData, output="output", 
    covNames = inputs$covNames, outcomeT = inputs$outcomeT,
    fixedEffectsNames = inputs$fixedEffectNames)


# example with Bernoulli outcome and Mixed covariates
inputs <- generateSampleDataFile(clusSummaryBernoulliMixed())
runInfoObj<-profRegr(yModel=inputs$yModel, 
    xModel=inputs$xModel, nSweeps=10, nClusInit=15,
    nBurn=20, data=inputs$inputData, output="output", 
    discreteCovs = inputs$discreteCovs,
    continuousCovs = inputs$continuousCovs)

Run the code above in your browser using DataLab