Learn R Programming

DIVINE (version 0.1.1)

impute_missing: Replace Missing Values

Description

Replace missing values (NA) in a data.frame with a specified value or method (such as mean, median, mode, constant, or custom function), applying imputation column-wise.

Usage

impute_missing(
  data,
  method = list(dplyr::where(is.numeric) ~ "mean", dplyr::where(is.character) ~ "mode",
    dplyr::where(is.factor) ~ "mode"),
  filter_by = NULL,
  drop_all_na = FALSE,
  verbose = TRUE
)

Value

A tibble with missing values replaced according to the provided specifications.

Arguments

data

A data frame. The dataset in which missing values should be imputed.

method

A list of one-sided formulas of the form <selector> ~ <value>. Supported <value> options are:

  • "mean": replace with the column mean (numeric columns only).

  • "median": replace with the column median (numeric columns only).

  • "mode": replace with the most frequent value (works for numeric, character, or factor).

  • A numeric constant: replace with that constant (numeric columns).

  • A character constant: replace with that value (character/factor columns).

  • A function: a function function(col) that receives the column and returns a single value to be used as replacement for NA.

The default is list(dplyr::where(is.numeric) ~ "mean",dplyr::where(is.character) ~ "mode",dplyr::where(is.factor) ~ "mode").

filter_by

Character vector of column names. If provided, only rows that have all specified columns non-NA are kept (applied before imputation).

drop_all_na

Logical; if TRUE, rows where all columns are NA are removed before imputation.

verbose

Logical; if TRUE (default) print a concise final summary of what was imputed. Set to FALSE to suppress messages.

Details

You can remove rows that are entirely NA before imputation using drop_all_na, or filter rows based on specific variables using filter_by.

  • The method argument uses tidyselect helpers. For example, where(is.numeric()) ~ "median" imputes all numeric columns by their medians.

  • "mode" works for numeric, character and factor columns.

  • When imputing factors with a character constant, the constant is added as a new level if needed.

  • When passing a custom function, it should return at least one value; if multiple values are returned, only the first is used (with a warning).

Examples

Run this code
# Impute all numeric columns by their means:
impute_missing(icu)

# Impute numeric columns by median:
impute_missing(
  icu,
  method = list(where(is.numeric) ~ "median")
)

# Keep only rows where both "vent_mec_no_inv" and "vent_mec" are non-missing:
impute_missing(
  icu,
  filter_by = c("vent_mec_no_inv", "vent_mec")
)

Run the code above in your browser using DataLab