get_candidate_covariates: Generate candidate empirical baseline covariates based on prevalence in the baseline period

Description

get_candidate_covariates function generates the list of candidate empirical covariates based on their prevalence within each domains (dimensions). This is the first step in the automated covariate selection process. See 'Automated Covariate Selection' section below for more details regarding the overall process.

Usage

get_candidate_covariates(
  df,
  domainVarname,
  eventCodeVarname,
  patientIdVarname,
  patientIdVector,
  n = 200,
  min_num_patients = 100
)

Arguments

The input data.frame. This should contain at least 3 fields containing information on patient identifier, covariate codes and domain names of covariate codes in a long format. Any other fields containing values such as dates, treatment group are optional and will be ignored for this analysis

domainVarname

The variable(field) name which contains the domain of the covariate in the df. The domains are usually diagnosis, procedures and medications.

eventCodeVarname

The variable name which contains the covariate codes (eg:- CCS, ICD9) in the df

patientIdVarname

The variable name which contains the patient identifier in the df

patientIdVector

The 1-D vector with all the patient identifiers. The length of this vector should be equal to the number of distinct patients in the df. This vector is not really used in the function analysis per se. This is used only to return the same back as function output because the filtered df based on covars will likely not contain all patients in the input df because there could be patients for whom no records were found for any of the identified covars and they will thus be not present in the filtered df which is also an output of this function. The patientIds vector output will contain all original patients and by returning this vector, it can later be used in the next steps of automated covariate selection because each step is dependent on previous steps and information on patients who did not have any identified covars is also important for the next steps. This is why this vector is an input as well as an output, without affecting the analysis of this function.

The maximum number of empirical candidate baseline covariates that should be returned within each domain. By default, n is 200

min_num_patients

Minimum number of patients that should be present for each covariate to be selected for selection. To be considered for selection, a covariate should have occurred for a minimum min_num_patients in the baseline period

Value

A named list containing three R objects

covars A 1-D vector containing the names of selected baseline covariate names from each domain. For each domain in the df, the number of covars would be equal to or less than n
covars_data The data.frame that is filtered out of df with only the selected covars. The values of the eventCodeVarname field is prefixed with the corresponding domain name. For example, if the event code is 19900 and the domain is 'dx', then the the covariate name will be 'dx_19900'.
patientIds The list of patient ids present in the original input df. This is exactly the same as the input patientIdVector

Automated Covariate Selection

The three steps in automated covariate selection are listed below with the functions implementing the methodology

Identify candidate empirical covariates: get_candidate_covariates
Assess recurrence: get_recurrence_covariates
Prioritize covariates: get_prioritised_covariates

Details

The theoretical details of the high-dimensional propensity score (HDPS) algorithm is detailed in the publication listed below in the References section. get_candidate_covariates is the function implementing what is described in the 'Identify candidate empirical covariates' section of the article.

References

Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data Epidemiology. 2009;20(4):512-522. doi:10.1097/EDE.0b013e3181a663cc

Examples

Run this code

# NOT RUN {
library("autoCovariateSelection")
data(rwd)
head(rwd, 3)
#select distinct elements that are unique for each patient - treatment and outcome
basetable <- rwd %>% select(person_id, treatment, outcome_date) %>% distinct()
head(basetable, 3)
patientIds <- basetable$person_id
step1 <- get_candidate_covariates(df = rwd,  domainVarname = "domain",
eventCodeVarname = "event_code", patientIdVarname = "person_id",
patientIdVector = patientIds,n = 100, min_num_patients = 10)
out1 <- step1$covars_data #this will be input to get_recurrence_covariates() function
# }

Run the code above in your browser using DataLab