Learn R Programming

starling (version 0.6.5)

murmuration: Links case, hospital or vaccination datasets

Description

Machine learning data linkage. The murmuration command will link diagnostic registry data (cases or linelist) to hospitalization and immunization records (e.g. Australian Immunization Register).

Usage

murmuration(
  df1,
  df2,
  linkage_type = "c2h",
  onset_date = NULL,
  event_date = NULL,
  id_var = id_var,
  blocking_var,
  compare_vars,
  threshold_value = 12,
  days_allowed_before_event = 7,
  days_allowed_after_event = 14,
  one_row_per_person = TRUE,
  clean_eggs = TRUE,
  days_between_onset_death = 30,
  last_follow_up = NULL
)

Value

A linked dataset with some new variables.

Arguments

df1

This is a dataframe object, cleaned using clean_build_nest, and would often represent the base, or "x" dataset (when doing left joins). Typically this would be a dataset of cases, have enough data to create linkages, and have onset dates.

df2

This is a dataframe object, cleaned using clean_build_nest, and would often represent the admissions or vaccination dataset ("y" dataset when doing left joins). Typically this would have enough data to create linkages, and include either admission data or vaccination event data (e.g. Australian Immunization Register).

linkage_type

Parameter name. Either "c2h", for linkage of cases to hospital admissions data (default). "v2c" for linkage of cases to vaccination datasets. "v2h" for linkage of hospitalizations to vaccination history (e.g. building a dataset for test-negative case-control studies). Use "v2h" if you want to link a "v2c" dataset to a hospitalization dataset. "v2e" for linkage of event participants (flight manifest, outbreak linelist) to vaccination history to determine vaccination status at time of event. If using linking to a vaccination dataset, must use single row per person dataset. If you have multiple vaccines per person, run it through the clean_the_nest command with "lie_nest_flat" option set to TRUE.

onset_date

Variable name for onset date (used in c2h and v2c linkage types). Should be present in df1.

event_date

A date object (e.g., ymd('2024-12-15')) representing when the event occurred. Required for v2e linkage type. All valid vaccinations must occur before this date.

id_var

Variable name (e.g. "id")This is critical for data-linkage and the base dataset is the dataset you would left join onto (e.g. the "x" dataset). Cannot have missing data, or the observation will be lost in the linking process.

blocking_var

Variable name (e.g. "block1"). Choice of blocking variable. You can create your own. Up to three blocking vars are created in the past

compare_vars

Vector of variables. Used to compare variables between each dataset and calculate the string score differences. Typically names, dates of births and medicare/social security numbers.

threshold_value

Numeric (e.g. "12"), default is 12. This represents the threshold above which you decide that the linkage is true or false. The higher the number, the higher the specificity of your linkages (compare_vars match more exactly). The lower the threshold, the more sensitive you are to selecting matches, at the expense of specificity. Default is 12, and arbitrarily chosen.

days_allowed_before_event

Numeric (e.g. "7"). How much time you choose to allow prior to the onset_date of a disease-related admission for a c2h dataset (see linkage type). For c2h linkages, this represents the lower limit of the window for disease related admissions. For v2h datasets this represents the minimum time between latest vaccination date and admission date to be considered a valid vaccination dose. For a v2c datasets this represents the minimum time between latest vaccination date and onset_date to be considered a valid vaccination dose. For v2e datasets this represents the minimum time between latest vaccination date and event_date to be considered a valid vaccination dose. For example, if you choose seven days, then you are allowing for an admission to occur up to seven days prior to the diagnosis, which means the disease was diagnosed while an inpatient.

days_allowed_after_event

Numeric (e.g. "30"). How much time you choose to allow after the onset of a disease related admission. Upper limit of window for disease related admissions. For example, if you choose 30 days, then you are allowing for a disease-related admission to occur up to 30 days after the diagnosis, which means the disease was diagnosed very close to or prior to the admission.

one_row_per_person

Logical (TRUE or FALSE) with the default being TRUE. It will take multiple admissions per person, and create a series of variables prefixed with "first_", such as "first_admission_date", and put into a single row all admission events, and create a series of variables suffixed with "s", such as "admission_dates". Will work with single admissions per person.

clean_eggs

Logical (TRUE or FALSE) with the default being TRUE. Drops all the .y variables that are duplicates of the second dataset (df2), and keeps the variables and removes the .x from df1. If you leave this on, many, if not most variables will have ".x" or ".y" attached to them (e.g. gender) and thus keep this as TRUE for default, and FALSE if you want to check the linkages are true and working.

days_between_onset_death

Numeric (e.g. "30"). If you have put a date of death into the clean_build_nest command (which will rename it to dod), then the command will find disease related dates of death. This is chosen number of days between onset and death for a disease-related death. Often this may be 30 days for SARS-CoV-2 or can be much longer for HIV. If you don't want an upper limit, use "9999".

last_follow_up

represents a date (input as ymd(2024-11-22)) that represents last follow-up. This could be the latest admission date of a dataset. Used for calculating survival time.

Details

A murmuration is a shape-shifting flock of thousands of starlings all flying in synch with each other. Murmuration means that each bird must be linked (through observation of their movements) to approximately seen other birds to achieve the beautiful sky art that moves through the sky. Make sure that you do not have the same variables (other than linkage variables e.g. letternames, DOB, gender) in both datasets. Always make sure your date columns are properly formatted using as_date, or as.Date. For example, if both datasets have date of death, choose the dataset with the highest confidence, and drop out the date of death from the other dataset. If the dataset is being linked to a hospitalization dataset, the difference in time between onset_date and admission_date will be used to identify related hospitalizations. The user can filter out unrelated hospitalizations using diagnostic-related groups or ICD-10 codes separately, prior to linkage. Classic workflow would be:

  1. clean_the_nest to clean and prep data for linkage. Pay close attention to your linkage variables (letternames, date of birth, medicare number, gender and/or postcode), and ensure all dates are formatted as dates.

  2. murmuration with linkage_type="v2c" to link cases to vaccination data.

  3. murmuration with linkage_type="v2h" to link a v2c dataset to hospitalization data. Or skip linking to case data, and just build a v2h dataset for test-negative case-control studies.

  4. murmuration with linkage_type="v2e" to link event linelists (flight manifests, outbreak investigations) to vaccination data.

  5. preening to prettify the dataframe prepping it for exploration, analysis and presentation. Great to use with gtsummary::tbl_summary().

Examples

Run this code
# \donttest{
# Example 1: Link cases to vaccination history
# First, clean the datasets to standardize column names
dx_clean <- clean_the_nest(dx_data,
  data_type = "cases",
  id_var = "identity",
  lettername1 = "first_name",
  lettername2 = "surname",
  dob = "date_of_birth",
  gender = "gender",
  postcode = "postcode",
  medicare = "medicare_no",
  diagnosis = "disease_name")

vax_clean <- clean_the_nest(vax_data,
  data_type = "vaccination",
  id_var = "patient_id",
  lettername1 = "firstname",
  lettername2 = "last_name",
  dob = "birth_date",
  gender = "gender",
  postcode = "postcode",
  medicare = "medicare_number",
  vax_type = "vaccine_delivered",
  vax_date = "service_date")

# Now link cases to vaccination history
df1 <- murmuration(dx_clean, vax_clean,
  linkage_type = "v2c",
  blocking_var = "gender",
  compare_vars = c("lettername1", "lettername2", "dob"),
  clean_eggs = FALSE)

# Example 2: Link hospitalization data to vaccination history
hosp_clean <- clean_the_nest(hosp_data,
  data_type = "hospital",
  id_var = "patient_id",
  lettername1 = "firstname",
  lettername2 = "last_name",
  dob = "birth_date",
  gender = "sex",
  postcode = "zip_codes",
  medicare = "medicare_number",
  admission_date = "date_of_admission",
  discharge_date = "date_of_discharge")

df2 <- murmuration(hosp_clean, vax_clean,
  linkage_type = "v2c",
  blocking_var = "gender",
  compare_vars = c("lettername1", "lettername2", "medicare10", "dob"),
  clean_eggs = FALSE,
  one_row_per_person = TRUE)

# Example 3: Link flight manifest to vaccination history
manifest_clean <- clean_the_nest(manifest_data,
  data_type = "cases",
  id_var = "passenger_id",
  lettername1 = "first_name",
  lettername2 = "surname",
  dob = "date_of_birth",
  gender = "gender")

df_flight <- murmuration(manifest_clean, vax_clean,
  linkage_type = "v2e",
  event_date = as.Date("2024-03-15"),
  blocking_var = "gender",
  compare_vars = c("lettername1", "lettername2", "dob"),
  days_allowed_before_event = 14,
  clean_eggs = FALSE)

# Example 4: Link outbreak linelist to vaccination history
linelist_clean <- clean_the_nest(linelist_data,
  data_type = "cases",
  id_var = "case_id",
  lettername1 = "first_name",
  lettername2 = "surname",
  dob = "date_of_birth",
  gender = "gender",
  postcode = "postcode",
  medicare = "medicare_no",
  onset_date = "onset_date")

df_outbreak <- murmuration(linelist_clean, vax_clean,
  linkage_type = "v2e",
  event_date = as.Date("2024-06-01"),
  blocking_var = "postcode",
  compare_vars = c("lettername1", "lettername2", "dob", "medicare10"),
  days_allowed_before_event = 7,
  clean_eggs = FALSE)
# }

Run the code above in your browser using DataLab