epitrix (version 0.4.0)

hash_names: Anonymise data using scrypt

Description

This function uses the scrypt algorithm from libsodium to anonymise data, based on user-indicated data fields. Data fields are concatenated first, then each entry is hashed. The function can either return a full detailed output, or short labels ready to use for 'anonymised data'. Before concatenation (using "_" as a separator) to form labels, inputs are modified using [clean_labels()]

Usage

hash_names(
  ...,
  size = 6,
  full = TRUE,
  hashfun = "secure",
  salt = NULL,
  clean_labels = TRUE
)

Arguments

...

Data fields to be hashed.

size

The number of characters retained in the hash.

full

A logical indicating if the a full output should be returned as a data.frame, including original labels, shortened hash, and full hash.

hashfun

This defines the hashing function to be used. If you specify "secure" (default), it will use [sodium::scrypt()], which will be secure, but will be slow for large data sets. For fast hashing with no colisions, you can sepecify "fast", and it will use [sodium::sha256()], which is several orders of magnitude faster than [sodium::scrypt()]. You can also specify a hashing function that takes and returns a [raw][base::raw] vector of bytes that can be converted to character with [rawToChar()].

salt

An optional object that can be coerced to a character to be used to 'salt' the hashing algorithm (see details). Ignored if `NULL`.

clean_labels

A logical indicating if labels of variables should be standardized; defaults to `TRUE`

Author

Thibaut Jombart thibautjombart@gmail.com, Dirk Shchumacher mail@dirk-schumacher.net, Zhian N. Kamvar zkamvar@gmail.com

Details

The argument `salt` should be used for salting the algorithm, i.e. adding an extra input to the input fields (the 'salt') to change the resulting hash and prevent identification of individuals via pre-computed hash tables.

It is highly recommend to choose a secret, random salt in order make it harder for an attacker to decode the hash.

See Also

[clean_labels()], used to clean labels prior to hashing
[sodium::hash()] for available hashing functions.

Examples

Run this code

first_name <- c("Jane", "Joe", "Raoul")
last_name <- c("Doe", "Smith", "Dupont")
age <- c(25, 69, 36)

# secure hashing
hash_names(first_name, last_name, age, hashfun = "secure")

# fast hashing
hash_names(first_name, last_name, age,
           size = 8, full = FALSE, hashfun = "fast")


## salting the hashing (more secure!)

hash_names(first_name, last_name) # unsalted - less secure
hash_names(first_name, last_name, salt = 123) # salted with an integer
hash_names(first_name, last_name, salt = "foobar") # salted with an character

## using a different hash algorithm if you want things to run faster

hash_names(first_name, last_name, hashfun = "fast") # use sha256 algorithm

Run the code above in your browser using DataLab