anonymize_dataframe: Anonymize Patient Data in a Data Frame

Description

Main function to anonymize patient data in a data frame or data.table. Automatically detects and anonymizes columns based on data types and naming patterns, or you can manually specify columns. Different datasets get different anonymized values for better privacy.

Usage

anonymize_dataframe(
  data,
  id_cols = NULL,
  name_cols = NULL,
  date_cols = NULL,
  location_cols = NULL,
  age_cols = NULL,
  auto_detect = TRUE,
  detect_by_type = TRUE,
  date_method = "shift",
  date_granularity = "month",
  location_method = "generalize",
  age_method = "10year",
  use_uuid = TRUE,
  seed = NULL,
  dataset_specific = TRUE
)

Value

A data frame with anonymized patient data (preserves data.table class if input was data.table)

Arguments

data: A data frame or data.table containing patient data
id_cols: Character vector of column names containing patient IDs
name_cols: Character vector of column names containing patient names
date_cols: Character vector of column names containing dates
location_cols: Character vector of column names containing locations
age_cols: Character vector of column names containing ages
auto_detect: Logical, if TRUE (default), automatically detects columns based on data types and common naming patterns
detect_by_type: Logical, if TRUE (default), detects columns by their R data types (Date, character, etc.) in addition to name patterns
date_method: Method for date anonymization: "shift" or "round" (default: "shift"). Use "round" to enable granularity options including "month_year" (YYYYMM format).
date_granularity: For date rounding (when date_method = "round"): "day", "week", "month", "month_year" (returns YYYYMM format, e.g., "202005"), "quarter", or "year" (default: "month")
location_method: Method for location anonymization: "remove" or "generalize"
age_method: Method for age anonymization: "10year" (default) uses 10-year buckets (0-9, 10-19, 20-29, ..., 80-89, 90+) for better research utility, or "hipaa" for HIPAA-compliant buckets (0-17, 18-64, 65-89, 90+)
use_uuid: Logical, if TRUE uses short UUIDs for IDs, names, and locations instead of sequential identifiers (default: TRUE). Dates and ages are not affected.
seed: An optional seed for reproducible anonymization. Different datasets will still get different anonymized values even with the same seed.
dataset_specific: Logical, if TRUE (default), generates dataset-specific seeds so different datasets get different anonymized values

Examples

Run this code

# Basic usage with auto-detection
patient_data <- data.frame(
  patient_id = c("P001", "P002", "P003"),
  name = c("John Doe", "Jane Smith", "Bob Johnson"),
  dob = as.Date(c("1980-01-15", "1975-03-20", "1990-06-10")),
  location = c("New York, NY", "Los Angeles, CA", "Chicago, IL"),
  diagnosis = c("A", "B", "A")
)
anonymize_dataframe(patient_data, seed = 123)

# With month_year date granularity (YYYYMM format)
anonymize_dataframe(patient_data, date_method = "round", date_granularity = "month_year")

# Works with data.table
if (requireNamespace("data.table", quietly = TRUE)) {
  dt <- data.table::as.data.table(patient_data)
  anonymize_dataframe(dt)
}

# With UUID anonymization (default)
anonymize_dataframe(patient_data, seed = 123)

# Without UUID (sequential IDs)
anonymize_dataframe(patient_data, use_uuid = FALSE, seed = 123)

Run the code above in your browser using DataLab