penmodelEM: EM algorithm for estimating the penetrance model with missing genotypes

Description

Fits a penetrance model for family data with missing genotypes via the EM algorithm and provides model parameter estimates.

Usage

penmodelEM(formula, cluster = "famID", gvar = "mgene", parms, cuts = NULL, data, 
design = "pop", base.dist = "Weibull", agemin = NULL, robust = FALSE, method = "data", 
mode = "dominant", q = 0.02)

Arguments

formula

A formula expression as for other regression models. The response should be a survival object as returned by the Surv function. See the documentation for Surv, lm and formula for details.

cluster

Name of cluster variable. Default is "famID".

gvar

Name of genetic variable. Default is "mgene".

parms

Vector of initial values for the parameters in the model including baseline parameters and regression coefficients. parms = c(baseparm, coef), where baseparm includes the initial values for baseline parameters used for base.dist, and coef includes the initial values for regression coefficients for the variables specified in formula. See details for the baseline parameters.

cuts

Vector of cuts that define the intervals where the hazard function is constant. The cuts should be specified base.dist="piecewise" and must be strictly positive and finite. Default is NULL.

data

Data frame generated from simfam or data frame containing specific variables: famID, indID, gender, currentage, mgene, time, status and weight with attr(data,"agemin") specified.

design

Study design of the family data. Possible choices are: "pop", "pop+", "cli", "cli+" or "twostage", where "pop" is for the population-based design with affected probands whose mutation status can be either carrier or non-carrier, "pop+" is similar to "pop" but with mutation carrier probands, "cli" is for the clinic-based design that includes affected probands with at least one parent and one sib affected, "cli+" is similar to "cli" but with mutation carrier probands, and "twostage" is for the two-stage design with oversampling of high risks families. Default is "pop".

base.dist

Choice of baseline hazard distributions to fit. Possible choices are: "Weibull", "loglogistic", "Gompertz", "lognormal", "gamma", "logBurr", or "piecewise". Default is "Weibull".

agemin

Minimum age of disease onset or minimum age. Default is NULL.

robust

Logical; if TRUE, the robust `sandwich' standard errors and variance-covariance matrix are provided, otherwise the conventional standard errors and variance-covariance matrix are provided.

method

Choice of methods for calculating the carrier probabilities for individuals with missing mutation status. Possible choices are "data" for empirical calculation of the carrier probabilities based on the observed carriers' statuses in the entire sample, specific to generation and proband's mutation status or "mendelian" for calculating carrier probabilities based on Mendelian transmission probabilies with the given allele frequency and mutation statuses observed in the family. Default is "data".

If method = "mendelian", specify both mode of the inheritance and the allele frequency q.

mode

Choice of modes of inheritance for calculating carrier probabilies for individuals with missing mutation status. Possible choices are "dominant" for dominant model or "recessive" for recessive model. Default is "dominant".

Frequency of the disease causing allele used for calculating carrier pobabilities. The value should be between 0 and 1. If NULL, the estimated allele frequency from data will be used. Default value is 0.02.

Value

Returns an object of class 'penmodel', including the following elements:

estimates

Parameter estimates of transformed baseline parameters and regression coefficients.

varcov

Variance-covariance matrix of parameter estimates obtained from the inverse of Hessian matrix.

varcov.robust

Robust `sandwich' variance-covariance matrix of parameter estimates when robust=TRUE.

Standard errors of parameter estimates obtained from the inverse of Hessian matrix.

se.robust

Robust `sandwich' standard errors of parameter estimates when robust=TRUE.

logLik

Loglikelihood value for the fitted penetrance model.

AIC

Akaike information criterion (AIC) value of the model; AIC = 2*k - 2*logLik, where k is the number of parameters used in the model.

Details

The expectation and maximization (EM) algorithm is applied for making inference about the missing genotypes. In the expectation step, for individuals with unknown carrier status, we first compute their carrier probabilities given their family's observed phenotype and genotype information based on current estimates of parameters as follows,

where represents the mutation carrier status and represents the phenotype in terms of age at onset and disease status for individual in family \(f, f = 1, ..., n,\) and represents the observed genotypes in family \(f\).

Then, we obtain the conditional expectation of the log-likelihood function (\(\ell\)) of the complete data given the observed data as a weighted log-likelihood, which has the form

In the maximization step, the updated parameter estimates are obtained by maximizing the weighted log likelihood computed in the E-step. These expectation and maximization steps iterate until convergence to obtain the maximum likelihood estimates. See more details in Choi and Briollais (2011) or Choi et al. (2014).

Note that the baseline parameters include lambda and rho, which represent the scale and shape parameters, respectively, and eta, additional parameter to specify for "logBurr" distribution. For the "lognormal" baseline distribution, lambda and rho represent the location and scale parameters for the normally distributed logarithm, where lambda can take any real values and rho > 0. For the other baselinse distributions, lambda > 0, rho > 0, and eta > 0. When a piecewise constant distribution is specified for the baseline hazards, base.dist="piecewise", baseparm should specify the initial interval-constant values, one more than the cut points specified bycuts.

Transformed baseline parameters are used for estimation; log transformation is applied to both scale and shape parameters (\(\lambda, \rho\)) for "Weibull", "loglogistic", "Gompertz" and "gamma" baselines, to (\(\lambda, \rho, \eta\)) for "logBurr" and to the piecewise constant parameters for a piecewise baseline hazard. For "lognormal" baseline distribution, the log transformation is applied only to \(\rho\), not to \(\lambda\), which represents the location parameter for the normally distributed logarithm.

Calculations of penetrance estimates and their standard errors and 95% confidence intervals at given ages can be obtained by penetrance function via Monte-Carlo simulations of the estimated penetrance model.

References

Choi, Y.-H., Briollais, L., He, W. and Kopciuk, K. (2021) FamEvent: An R Package for Generating and Modeling Time-to-Event Data in Family Designs, Journal of Statistical Software 97 (7), 1-30. doi:10.18637/jss.v097.i07

Choi, Y.-H. and Briollais, L. (2011) An EM composite likelihood approach for multistage sampling of family data with missing genetic covariates, Statistica Sinica 21, 231-253.

Choi, Y.-H., Briollais, L., Green, J., Parfrey, P., and Kopciuk, K. (2014) Estimating successive cancer risks in Lynch Syndrome families using a progressive three-state model, Statistics in Medicine 33, 618-638.

Examples

Run this code

# NOT RUN {
# Family data simulated with 20% of members missing their genetic information.

set.seed(4321)
fam <- simfam(N.fam = 100, design = "pop+", base.dist = "Weibull", base.parms = c(0.01, 3),
       vbeta = c(1, 2), agemin = 20, allelefreq = 0.02, mrate = 0.2)
 
# EM algorithm for fitting family data with missing genotypes

fit <- penmodelEM(Surv(time, status) ~ gender + mgene, cluster = "famID", gvar = "mgene", 
       parms = c(0.01, 3, 1, 2), data = fam, design="pop+", robust = TRUE, 
       base.dist = "Weibull", method = "mendelian", mode = "dominant", q = 0.02)

# Summary of the model parameter estimates from the model fit by penmodelEM 

summary(fit)

# Plot the lifetime penetrance curves from model fit for gender and 
# mutation status groups along with their nonparametric penetrance curves 
# based on observed data excluding probands.
 
plot(fit)

# }

Run the code above in your browser using DataLab