Learn R Programming

starling (version 0.6.5)

molting: Molt: De-identify a Dataset with Hash-based Relinking

Description

Like a bird molting its feathers for new plumage, this function removes identifiable information and replaces it with a unique hash for each row. It returns both the de-identified dataset and a lookup table for relinking. Age category variables (age2cat, age3cat, etc.) are automatically retained.

Usage

molting(
  data,
  id_cols = NULL,
  pii_patterns = NULL,
  additional_pii_cols = NULL,
  hash_method = "sha256",
  hash_col_name = "row_hash",
  return_lookup = TRUE,
  seed = NULL
)

Value

If return_lookup = TRUE (default), a list with two elements:

  • deidentified: The de-identified data frame with hash column

  • lookup: A data frame containing only the identifier columns and the hash for relinking

If return_lookup = FALSE, returns only the de-identified data frame.

Arguments

data

A data frame to be de-identified.

id_cols

An optional character vector of column names to use for creating the hash. If NULL (the default), the function will use the PII columns it automatically detects.

pii_patterns

An optional character vector of regular expression patterns used to detect PII columns for removal. The default list includes common identifiers.

additional_pii_cols

An optional character vector of specific column names to remove as PII, in addition to those detected by pattern matching. Useful for adding dataset-specific identifiers without modifying patterns.

hash_method

The hashing algorithm to use. Options include "sha256" (default), "md5", "sha1", "sha512", "crc32", "xxhash32", "xxhash64", "murmur32", "spookyhash", or "blake3". See ?digest::digest for details.

hash_col_name

A string for the name of the new hash column. Defaults to "row_hash".

return_lookup

Logical. If TRUE (default), returns a list containing both the de-identified data and a lookup table. If FALSE, returns only the de-identified data frame.

seed

An optional integer seed for reproducible hashing with certain algorithms. Defaults to NULL.

Details

The function identifies PII columns based on pattern matching, creates a unique hash for each row based on the concatenated identifier values, and returns both a de-identified dataset and a secure lookup table.

Age category variables (variables matching the pattern "age\d+cat" such as age2cat, age5cat, age10cat, etc.) are automatically retained in the de-identified dataset as they are not considered directly identifying.

Security Note: The lookup table contains sensitive information and should be stored securely with appropriate access controls. Consider encrypting this file if storing to disk.

Examples

Run this code
# Create sample data
patient_data <- data.frame(
  patient_name = c("John Doe", "Jane Smith"),
  dob = as.Date(c("1980-01-01", "1975-05-15")),
  mrn = c("12345", "67890"),
  age5cat = factor(c("18-64", "18-64")),
  diagnosis = c("Condition A", "Condition B"),
  lab_value = c(120, 95)
)

# Basic de-identification (age categories automatically retained)
result <- suppressMessages(molting(patient_data))
names(result$deidentified)  # Check column names
head(result$deidentified, 2)  # View de-identified data

# Use different hash method
result_md5 <- suppressMessages(
  molting(patient_data, hash_method = "md5")
)

# Return only de-identified data (no lookup table)
deidentified_only <- suppressMessages(
  molting(patient_data, return_lookup = FALSE)
)

# Add specific columns to PII removal
result_custom <- suppressMessages(
  molting(patient_data, additional_pii_cols = c("study_id"))
)

# Specify custom identifier columns for hashing
result_ids <- suppressMessages(
  molting(patient_data, id_cols = c("mrn", "dob"))
)

Run the code above in your browser using DataLab