Fits a penetrance model for family data with missing genotypes via the EM algorithm and provides model parameter estimates.
penmodelEM(formula, cluster = "famID", gvar = "mgene", parms, cuts = NULL, data,
design = "pop", base.dist = "Weibull", agemin = NULL, robust = FALSE, method = "data",
mode = "dominant", q = 0.02)
A formula expression as for other regression models. The response should be a survival object as returned by the Surv
function. See the documentation for Surv
, lm
and formula
for details.
Name of cluster variable. Default is "famID"
.
Name of genetic variable. Default is "mgene"
.
Vector of initial values for the parameters in the model including baseline parameters and regression coefficients. parms = c(baseparm, coef)
, where baseparm
includes the initial values for baseline parameters used for base.dist
, and coef
includes the initial values for regression coefficients for the variables specified in formula
. See details for the baseline parameters.
Vector of cuts that define the intervals where the hazard function is constant. The cuts
should be specified base.dist="piecewise"
and must be strictly positive and finite. Default is NULL
.
Data frame generated from simfam
or data frame containing specific variables: famID
, indID
, gender
, currentage
, mgene
, time
, status
and weight
with attr(data,"agemin")
specified.
Study design of the family data. Possible choices are: "pop"
, "pop+"
, "cli"
, "cli+"
or "twostage"
, where "pop"
is for the population-based design with affected probands whose mutation status can be either carrier or non-carrier, "pop+"
is similar to "pop"
but with mutation carrier probands, "cli"
is for the clinic-based design that includes affected probands with at least one parent and one sib affected, "cli+"
is similar to "cli"
but with mutation carrier probands, and "twostage"
is for the two-stage design with oversampling of high risks families. Default is "pop"
.
Choice of baseline hazard distributions to fit. Possible choices are: "Weibull"
, "loglogistic"
, "Gompertz"
, "lognormal"
, "gamma"
, "logBurr"
, or "piecewise"
. Default is "Weibull"
.
Minimum age of disease onset or minimum age. Default is NULL
.
Logical; if TRUE
, the robust `sandwich' standard errors and variance-covariance matrix are provided, otherwise the conventional standard errors and variance-covariance matrix are provided.
Choice of methods for calculating the carrier probabilities for individuals with missing mutation status. Possible choices are "data"
for empirical calculation of the carrier probabilities based on the observed carriers' statuses in the entire sample, specific to generation and proband's mutation status or "mendelian"
for calculating carrier probabilities based on Mendelian transmission probabilies with the given allele frequency and mutation statuses observed in the family. Default is "data"
.
If method = "mendelian"
, specify both mode
of the inheritance and the allele frequency q
.
Choice of modes of inheritance for calculating carrier probabilies for individuals with missing mutation status. Possible choices are "dominant"
for dominant model or "recessive"
for recessive model. Default is "dominant"
.
Frequency of the disease causing allele used for calculating carrier pobabilities. The value should be between 0 and 1. If NULL
, the estimated allele frequency from data will be used. Default value is 0.02
.
Returns an object of class 'penmodel'
, including the following elements:
Parameter estimates of transformed baseline parameters and regression coefficients.
Variance-covariance matrix of parameter estimates obtained from the inverse of Hessian matrix.
Robust `sandwich' variance-covariance matrix of parameter estimates when robust=TRUE
.
Standard errors of parameter estimates obtained from the inverse of Hessian matrix.
Robust `sandwich' standard errors of parameter estimates when robust=TRUE
.
Loglikelihood value for the fitted penetrance model.
Akaike information criterion (AIC) value of the model; AIC = 2*k - 2*logLik, where k is the number of parameters used in the model.
The expectation and maximization (EM) algorithm is applied for making inference about the missing genotypes. In the expectation step, for individuals with unknown carrier status, we first compute their carrier probabilities given their family's observed phenotype and genotype information based on current estimates of parameters as follows,
where represents the mutation carrier status and represents the phenotype in terms of age at onset and disease status for individual in family \(f, f = 1, ..., n,\) and represents the observed genotypes in family \(f\).
Then, we obtain the conditional expectation of the log-likelihood function (\(\ell\)) of the complete data given the observed data as a weighted log-likelihood, which has the form
In the maximization step, the updated parameter estimates are obtained by maximizing the weighted log likelihood computed in the E-step. These expectation and maximization steps iterate until convergence to obtain the maximum likelihood estimates. See more details in Choi and Briollais (2011) or Choi et al. (2014).
Note that the baseline parameters include lambda
and rho
, which represent the scale and shape parameters, respectively, and eta
, additional parameter to specify for "logBurr"
distribution. For the "lognormal"
baseline distribution, lambda
and rho
represent the location and scale parameters for the normally distributed logarithm, where lambda
can take any real values and rho
> 0. For the other baselinse distributions, lambda
> 0, rho
> 0, and eta
> 0. When a piecewise constant distribution is specified for the baseline hazards, base.dist="piecewise"
, baseparm
should specify the initial interval-constant values, one more than the cut points specified bycuts
.
Transformed baseline parameters are used for estimation; log transformation is applied to both scale and shape parameters (\(\lambda, \rho\)) for "Weibull"
, "loglogistic"
, "Gompertz"
and "gamma"
baselines, to (\(\lambda, \rho, \eta\)) for "logBurr"
and to the piecewise constant parameters for a piecewise
baseline hazard. For "lognormal"
baseline distribution, the log transformation is applied only to \(\rho\), not to \(\lambda\), which represents the location parameter for the normally distributed logarithm.
Calculations of penetrance estimates and their standard errors and 95% confidence intervals at given ages can be obtained by penetrance
function via Monte-Carlo simulations of the estimated penetrance model.
Choi, Y.-H., Briollais, L., He, W. and Kopciuk, K. (2021) FamEvent: An R Package for Generating and Modeling Time-to-Event Data in Family Designs, Journal of Statistical Software 97 (7), 1-30. doi:10.18637/jss.v097.i07
Choi, Y.-H. and Briollais, L. (2011) An EM composite likelihood approach for multistage sampling of family data with missing genetic covariates, Statistica Sinica 21, 231-253.
Choi, Y.-H., Briollais, L., Green, J., Parfrey, P., and Kopciuk, K. (2014) Estimating successive cancer risks in Lynch Syndrome families using a progressive three-state model, Statistics in Medicine 33, 618-638.
simfam
, penmodel
, print.penmodel
, summary.penmodel
,
print.summary.penmodel
, plot.penmodel
, carrierprob
# NOT RUN {
# Family data simulated with 20% of members missing their genetic information.
set.seed(4321)
fam <- simfam(N.fam = 100, design = "pop+", base.dist = "Weibull", base.parms = c(0.01, 3),
vbeta = c(1, 2), agemin = 20, allelefreq = 0.02, mrate = 0.2)
# EM algorithm for fitting family data with missing genotypes
fit <- penmodelEM(Surv(time, status) ~ gender + mgene, cluster = "famID", gvar = "mgene",
parms = c(0.01, 3, 1, 2), data = fam, design="pop+", robust = TRUE,
base.dist = "Weibull", method = "mendelian", mode = "dominant", q = 0.02)
# Summary of the model parameter estimates from the model fit by penmodelEM
summary(fit)
# Plot the lifetime penetrance curves from model fit for gender and
# mutation status groups along with their nonparametric penetrance curves
# based on observed data excluding probands.
plot(fit)
# }
Run the code above in your browser using DataLab