average_paf: Calculation of average and sequential paf taking into account risk factor sequencing

Description

Calculation of average and sequential paf taking into account risk factor sequencing

Usage

average_paf(
  data,
  model_list,
  parent_list,
  node_vec,
  prev = NULL,
  exact = TRUE,
  nperm = NULL,
  correct_order = 2,
  riskfactor_vec = NULL,
  ci = FALSE,
  boot_rep = 50,
  ci_type = c("norm"),
  ci_level = 0.95,
  ci_level_ME = 0.95,
  weight_vec = NULL,
  verbose = TRUE
)

Value

A SAF_summary object with average joint and sequential PAF for all risk factors in node_vec (or alternatively a subset of those risk factors if specified in riskfactor_vec).

Arguments

data: Data frame. A dataframe containing variables used for fitting the models. Must contain all variables used in fitting
model_list: List. A list of fitted models corresponding for the outcome variables in node_vec, with parents as described in parent_vec. This list must be in the same order as node_vec and parent_list. Models can be linear (lm), logistic (glm) or ordinal logistic (polr). Non-linear effects of variables (if necessary) should be specified via ns(x, df=y), where ns is the natural spline function from the splines library.
parent_list: A list. The ith element is the vector of variable names that are direct causes of ith variable in node_vec (Note that the variable names should be columns in data)
node_vec: A character vector corresponding to the nodes in the Bayesian network (The variable names should be column names in data). This must be specified from root to leaves - that is ancestors in the causal graph for a particular node are positioned before their descendants. If this condition is false the function will return an error.
prev: numeric. Prevalence of disease. Only relevant to set for case control datasets.
exact: logical. Default TRUE. If TRUE, an efficient calculation is used to calculate average PAF, which enables the average PAF from N! permutations, over all N risk factors to be calculated with only 2^N-1 operations. If FALSE, permutations are sampled
nperm: Default NULL Number of random permutations used to calculate average and sequential PAF. If correct_order is set to an integer value, nperm is reset to an integer multiple of factorial(N)/factorial(N-correct_order) depending on the size of nperm. If nperm is NULL or less than factorial(N)/factorial(N-correct_order), factorial(N)/factorial(N-correct_order) permutations will be sampled. If nperm is larger than factorial(N)/factorial(N-correct_order), nperm will be reset to the smallest integer multiple of factorial(N)/factorial(N-correct_order) less than the input value of nperm
correct_order: Default 3. This enforces stratified sampling of permutations where the first correct_order positions of the sampled permutations are evenly distributed over the integers 1 ... N, N being the number of risk factors of interest, over the sampled permutations. The other positions are randomly sampled. This automatically sets the number of simulations when nperm=NULL. For interest, if N=10 and correct_order=3, nperm is set to factorial(10)/factorial(10-3) = 720. This special resampling reduces Monte Carlo variation in estimated average and sequential PAFs.
riskfactor_vec: A subset of risk factors for which we want to calculate average, sequential and joint PAF
ci: Logical. If TRUE, a bootstrap confidence interval is computed along with a point estimate (default FALSE). If ci=FALSE, only a point estimate is produced. A simulation procedure (sampling permutations and also simulating the effects of eliminating risk factors over the descendant nodes in a Bayesian network) is required to produce the point estimates. The point estimate will change on repeated runs of the function. The margin of error of the point estimate is given when ci=FALSE
boot_rep: Integer. Number of bootstrap replications (Only necessary to specify if ci=TRUE). Note that at least 50 replicates are recommended to achieve stable estimates of standard error. In the examples below, values of boot_rep less than 50 are sometimes used to limit run time.
ci_type: Character. Default norm. A vector specifying the types of confidence interval desired. "norm", "basic", "perc" and "bca" are the available methods
ci_level: Numeric. Default 0.95. A number between 0 and 1 specifying the level of the confidence interval (when ci=TRUE)
ci_level_ME: Numeric. Default 0.95. A number between 0 and 1 specifying the level of the margin of error for the point estimate (only relevant when ci=FALSE and exact=FALSE)
weight_vec: An optional vector of inverse sampling weights (note with survey data, the variance may not be calculated correctly if sampling isn't independent). Note that this vector will be ignored if prev is specified, and the weights will be calibrated so that the weighted sample prevalence of disease equals prev. This argument can be ignored if data has a column weights with correctly calibrated weights
verbose: A logical indicator for whether extended output is produced when ci=TRUE, default TRUE

References

Ferguson, J., O’Connell, M. and O’Donnell, M., 2020. Revisiting sequential attributable fractions. Archives of Public Health, 78(1), pp.1-9.

Ferguson, J., Alvarez-Iglesias, A., Newell, J., Hinde, J. and O’Donnell, M., 2018. Estimating average attributable fractions with confidence intervals for cohort and case–control studies. Statistical methods in medical research, 27(4), pp.1141-1152

Examples

Run this code

library(splines)
library(survival)
library(parallel)
options(boot.parallel="snow")
options(boot.ncpus=2)
# The above could be set to the number of available cores on the machine
#  Simulated data on occupational and environmental exposure to chronic cough from Eide, 1995
# First specify the causal graph, in terms of the parents of each node.  Then put into a list
parent_urban.rural <- c()
parent_smoking.category <- c("urban.rural")
parent_occupational.exposure <- c("urban.rural")
parent_y <- c("urban.rural","smoking.category","occupational.exposure")
parent_list <- list(parent_urban.rural, parent_smoking.category,
 parent_occupational.exposure, parent_y)
# also specify nodes of graph, in order from root to leaves
node_vec <- c("urban.rural","smoking.category","occupational.exposure", "y")
# specify a model list according to parent_list
# here we use the auxillary function 'automatic fit'
model_list=automatic_fit(data=Hordaland_data, parent_list=parent_list,
 node_vec=node_vec, prev=.09)
# By default the function works by stratified simulation of permutations and
# subsequent simulation of the incremental interventions on the distribution of risk
# factors.  The permutations are stratified so each factor appears equally often in
# the first correct_order positions.  correct_order has a default of 2.

# model_list$data objects have fitting weights included
# Including weight column in data
# necessary if Bootstrapping CIs

out <- average_paf(data=model_list[[length(model_list)]]$data,
 model_list=model_list, parent_list=parent_list,
 node_vec=node_vec, prev=.09, nperm=10,riskfactor_vec = c("urban.rural",
 "occupational.exposure"),ci=FALSE)
 print(out)

# \donttest{
# More complicated example (slower to run)
parent_exercise <- c("education")
parent_diet <- c("education")
parent_smoking <- c("education")
parent_alcohol <- c("education")
parent_stress <- c("education")
parent_high_blood_pressure <- c("education","exercise","diet","smoking",
"alcohol","stress")
parent_lipids <- c("education","exercise","diet","smoking","alcohol",
"stress")
parent_waist_hip_ratio <- c("education","exercise","diet","smoking",
"alcohol","stress")
parent_early_stage_heart_disease <- c("education","exercise","diet",
"smoking","alcohol","stress","lipids","waist_hip_ratio","high_blood_pressure")
parent_diabetes <- c("education","exercise","diet","smoking","alcohol",
"stress","lipids","waist_hip_ratio","high_blood_pressure")
parent_case <- c("education","exercise","diet","smoking","alcohol","stress",
"lipids","waist_hip_ratio","high_blood_pressure",
"early_stage_heart_disease","diabetes")
parent_list <- list(parent_exercise,parent_diet,parent_smoking,
parent_alcohol,parent_stress,parent_high_blood_pressure,
parent_lipids,parent_waist_hip_ratio,parent_early_stage_heart_disease,
parent_diabetes,parent_case)
node_vec=c("exercise","diet","smoking","alcohol","stress",
"high_blood_pressure","lipids","waist_hip_ratio","early_stage_heart_disease",
"diabetes","case")
model_list=automatic_fit(data=stroke_reduced, parent_list=parent_list,
 node_vec=node_vec, prev=.0035,common="region*ns(age,df=5)+sex*ns(age,df=5)",
  spline_nodes = c("waist_hip_ratio","lipids","diet"))
out <- average_paf(data=stroke_reduced, model_list=model_list,
parent_list=parent_list, node_vec=node_vec, prev=.0035,
riskfactor_vec = c("high_blood_pressure","smoking","stress","exercise","alcohol",
"diabetes","early_stage_heart_disease"),ci=TRUE,boot_rep=10)
print(out)
plot(out,max_PAF=0.5,min_PAF=-0.1,number_rows=3)
# plot sequential and average PAFs by risk factor
# similar calculation, but now sampling permutations (stratified, so
# that each risk factor will appear equally often in the first correct_order positions)
out <- average_paf(data=stroke_reduced, model_list=model_list,
parent_list=parent_list, node_vec=node_vec, prev=.0035, exact=FALSE,
 correct_order=2, riskfactor_vec = c("high_blood_pressure","smoking","stress",
 "exercise","alcohol","diabetes","early_stage_heart_disease"),ci=TRUE,
 boot_rep=10)
 print(out)
 plot(out,max_PAF=0.5,min_PAF=-0.1,number_rows=3)
# }

Run the code above in your browser using DataLab