Learn R Programming

privacyR (version 1.0.1)

anonymize_dataframe: Anonymize Patient Data in a Data Frame

Description

Main function to anonymize patient data in a data frame or data.table. Automatically detects and anonymizes columns based on data types and naming patterns, or you can manually specify columns. Different datasets get different anonymized values for better privacy.

Usage

anonymize_dataframe(
  data,
  id_cols = NULL,
  name_cols = NULL,
  date_cols = NULL,
  location_cols = NULL,
  age_cols = NULL,
  auto_detect = TRUE,
  detect_by_type = TRUE,
  date_method = "shift",
  date_granularity = "month",
  location_method = "generalize",
  age_method = "10year",
  use_uuid = TRUE,
  seed = NULL,
  dataset_specific = TRUE
)

Value

A data frame with anonymized patient data (preserves data.table class if input was data.table)

Arguments

data

A data frame or data.table containing patient data

id_cols

Character vector of column names containing patient IDs

name_cols

Character vector of column names containing patient names

date_cols

Character vector of column names containing dates

location_cols

Character vector of column names containing locations

age_cols

Character vector of column names containing ages

auto_detect

Logical, if TRUE (default), automatically detects columns based on data types and common naming patterns

detect_by_type

Logical, if TRUE (default), detects columns by their R data types (Date, character, etc.) in addition to name patterns

date_method

Method for date anonymization: "shift" or "round" (default: "shift"). Use "round" to enable granularity options including "month_year" (YYYYMM format).

date_granularity

For date rounding (when date_method = "round"): "day", "week", "month", "month_year" (returns YYYYMM format, e.g., "202005"), "quarter", or "year" (default: "month")

location_method

Method for location anonymization: "remove" or "generalize"

age_method

Method for age anonymization: "10year" (default) uses 10-year buckets (0-9, 10-19, 20-29, ..., 80-89, 90+) for better research utility, or "hipaa" for HIPAA-compliant buckets (0-17, 18-64, 65-89, 90+)

use_uuid

Logical, if TRUE uses short UUIDs for IDs, names, and locations instead of sequential identifiers (default: TRUE). Dates and ages are not affected.

seed

An optional seed for reproducible anonymization. Different datasets will still get different anonymized values even with the same seed.

dataset_specific

Logical, if TRUE (default), generates dataset-specific seeds so different datasets get different anonymized values

Examples

Run this code
# Basic usage with auto-detection
patient_data <- data.frame(
  patient_id = c("P001", "P002", "P003"),
  name = c("John Doe", "Jane Smith", "Bob Johnson"),
  dob = as.Date(c("1980-01-15", "1975-03-20", "1990-06-10")),
  location = c("New York, NY", "Los Angeles, CA", "Chicago, IL"),
  diagnosis = c("A", "B", "A")
)
anonymize_dataframe(patient_data, seed = 123)

# With month_year date granularity (YYYYMM format)
anonymize_dataframe(patient_data, date_method = "round", date_granularity = "month_year")

# Works with data.table
if (requireNamespace("data.table", quietly = TRUE)) {
  dt <- data.table::as.data.table(patient_data)
  anonymize_dataframe(dt)
}

# With UUID anonymization (default)
anonymize_dataframe(patient_data, seed = 123)

# Without UUID (sequential IDs)
anonymize_dataframe(patient_data, use_uuid = FALSE, seed = 123)

Run the code above in your browser using DataLab