clean_the_nest: Clean datasets and establishes common variable name nomenclature

Description

Cleans three dataset types and prepares them for data-linkage. This command is the first step in creating the datasets for analysis. Building a solid "nest" is akin to building a solid foundation for future work. Of note, Starlings are cavity nesters, meaning that they prefer to build their homes inside holes and crevices. This command is meant to work with diagnosis datasets (linelists like Notifiable Conditions registers) and, hospitalization datasets (administrative datasets), and vaccination datasets. This command is used to prepare datasets for linkage with murmuration. There are no mandatory variables to include. However, a dataset of infections would include at minimum an onset date (date of diagnosis), a dataset of admissions would include admission dates, and a dataset of vaccinations would include dates of vaccination and type of vaccines. All of the datasets should include information that would allow for data-linkage, such as first name, last name, date of birth, address etc etc.

Classic workflow would be:

clean_the_nest to clean and prep data for linkage. Pay close attention to your linkage variables (letternames, date of birth, medicare number, gender and/or postcode), and ensure all dates are formatted as dates.
murmuration to link cases to vaccination data (named here "c2v").
murmuration to link c2v to hospitalization data (named here c2v2h). Of note, you can skip linking the vaccination dataset.
preening to prettify the dataframe prepping it for exploration, analysis and presentation. Great to use with gtsummary::tbl_summary().

Usage

clean_the_nest(
  data,
  id_var = NULL,
  event_id_var = NULL,
  drop_eggs = FALSE,
  data_type = NULL,
  lie_nest_flat = FALSE,
  drop_the_na_vax = TRUE,
  keep_vars = NULL,
  diagnosis = NULL,
  lettername1 = NULL,
  lettername2 = NULL,
  dob = NULL,
  age = NULL,
  medicare = NULL,
  postcode = NULL,
  gender = NULL,
  fn = NULL,
  latitude = NULL,
  longitude = NULL,
  onset_date = NULL,
  vax_type = NULL,
  vax_date = NULL,
  lag = 0,
  admission_date = NULL,
  discharge_date = NULL,
  hospital = NULL,
  icd_code = NULL,
  diagnosis_description = NULL,
  drg = NULL,
  icu_date = NULL,
  icu_hours = NULL,
  dialysis = NULL,
  genomics = NULL,
  dod = NULL,
  died = NULL
)

Value

The output is a dataframe that is cleaned and could be ready for machine learning data-linkage.

Arguments

data: The dataset, which can be a case notifications dataset (infections), hospital admissions or vaccination dataset (must pre-specify if it is a vaccinations dataset). Make sure dates are in date format.
id_var: Any format as long as unique to individual. This is important This ID variable is critical. Must ensure for case data that it only has one row per person, or first infection only. Identifies the multiple rows associated with a person who has multiple vaccines, admissions or infections. Cannot have missing data, or the observation will be lost in the linking process.
event_id_var: Any format as long as unique for the whole dataset. This represents the ID of the vaccination event, or the hospitalization event, which MUST be distinct. A person (id_var) can have multiple events (event_id). Some datasets will surprise you with multiple entries for the same admission.
drop_eggs: This effectively drops the variables that are not being used. May turn this off if you need lots of extra information, but certainly good for the early stages of an analysis. Enables a lean dataset.
data_type: Three options: "vaccination", "hospital", or "cases". The key information required is that for linkage, and the vaccination events. No age or age categories will be calculated if it is a vaccination dataset.
lie_nest_flat: Takes a long vaccination dataset (like Australian Immunization Register; 1 or more rows per person) and turns it into a wide dataset - one row per person
drop_the_na_vax: Drops (removes) vaccines that are listed as having no names.
keep_vars: Vector list of variables. Variables in a vector list with quotation marks, as it will be used in a select statement.
diagnosis: Character format. The column with the infectious disease diagnosis listed. e.g. COVID-19, SARS-CoV-2, RSV, Influenza.
lettername1: Character format. First Name variable. If there is a second first name (some cases this might be a middle name), it will be removed during cleaning. All non-alphanumeric characters will be removed and everything becomes lower case.
lettername2: Character format. Last name variable. All non-alphanumeric characters will be removed and everything becomes lower case. Two part last names will be kept.
dob: Date format. The date of birth (make sure dates are in date format).
age: Numeric format. Include age only if it has been pre-specified in the dataset, and you don't want it re-calculated.
medicare: Numeric format. Medicare number. A medicare number with 9, 10 and 11 numbers will have been created. In Australia, the 10th number represents the card ID, and the 11th number represents the person ID. A family or individual will get a new card id (10th digit) every time their card expires.
postcode: Numeric format. Post code of person with no restriction on the number of digits.
gender: Character format. Pay close attention that your genders are in a similar format for data-linkage - "F", vs "0" vs "Female". This is left up to the user to clean.
fn: Character format. First Nations Status.
latitude: Numeric format. Latitude of address. Not explicitly required for linkage.
longitude: Numeric format. Longitude of address. Not explicitly required for linkage.
onset_date: Date format. Onset date of the illness. Commonly the date of diagnosis (date of the lab test or date of the first symptom). Must be in date format.
vax_type: Character format. Variable that indicates the vaccine type, brand, or antigen
vax_date: Date format. Variable that indicates the vaccination event date. Make sure is in date format, and arranged in order of dates you would like it to appear when it goes to wide format. For example, if it is not in order, vax_date_1 (an output variable) may be the latest vaccination date, instead of the first.
lag: Numeric format. Number of days to add to the vaccination event date. Useful to define when a person reaches peak immunity post-vaccination. For COVID-19 this is often thought to be 14 days. Default lag is zero days.
admission_date: Date format. Admission date variable. Typically, this should be later than the date of onset, but there are times when the disease is diagnosed in hospital.
discharge_date: Date format. Discharge date variable. This date should be later than the date of admission.
hospital: Hospital identifier. Typically name of the hospital.
icd_code: Character format. ICD code variable for the admission. No pre-specified format required.
diagnosis_description: Character format. Written description of the ICD code. For ease of understanding what the ICD codes mean, not a critical variable.
drg: Character format. Diagnostic related group variable for the admission. No pre-specified format required.
icu_date: Date format. ICU admission date preferably. Typically, this should be later than the date of onset and admission, but there are times when the disease is diagnosed in ICU.
icu_hours: ICU hours. Hours spent in ICU. Should be numeric.
dialysis: Dialysis indicator (0/1).
genomics: Character format. Genomics variable. Can be variant of SARS-CoV-2, or similarly the Hepatitis A.
dod: Date format. Variable representing date of death. Must only have one date of death chosen (in diagnosis dataset or hospitalization dataset, not both). If dod selected is from the hospitalization dataset, it will be deleted for persons without an admission.
died: Variable representing death, best use 0 and 1.

Examples

Run this code

# Basic usage of clean_the_nest.
# Use this to set up for datalinkage using the murmuration command and then cleaning with preening
data(dx_data)
df_diag <- clean_the_nest(dx_data, drop_eggs=TRUE, data_type = "cases",
  id_var ="identity",
  diagnosis = "disease_name",
  lettername1 = "first_name",
  lettername2 = "surname",
  dob = "date_of_birth",
  medicare = "medicare_no",
  gender = "gender",
  postcode="postcode",
  fn="indigenous_status",
  onset_date = "diagnosis_date")

data(hosp_data)
df_hosp <- clean_the_nest(hosp_data, drop_eggs=TRUE,
  data_type = "hospital",
  id_var ="patient_id",
  lettername1 = "firstname",
  lettername2 = "last_name",
  dob = "birth_date",
  medicare = "medicare_number",
  gender = "sex",
  postcode="zip_codes",
  fn="cultural_heritage",
  icd_code = "icd_codes",
  admission_date = "date_of_admission",
  discharge_date = "date_of_discharge")

data(vax_data)
df_vax <- clean_the_nest(data = vax_data,
  data_type = "vaccination",
  lie_nest_flat=TRUE,
  id_var = "patient_id",
  lettername1="firstname",
  lettername2="last_name",
  dob="birth_date",
  medicare="medicare_number",
  gender = "gender",
  postcode = "postcode",
  vax_type = "vaccine_delivered",
  vax_date = "service_date")

Run the code above in your browser using DataLab