Learn R Programming

Background

Linkage of administrative data sources is an efficient approach for conducting research on large populations, avoiding the time and cost of traditional data collection methods. Careful development of methods for linking data where unique identifiers are not available is key to avoiding bias resulting from linkage errors. However, the development and evaluation of new methods are limited by restricted access to identifier data for these purposes. Generating synthetic datasets of personal identifiers, which replicate the frequencies and errors of identifiers observed in administrative data, could facilitate the development of new methods.

Aim

We aimed to develop the sdglinkage package for generating synthetic dataset for linkage method development, with i) gold standard file with complete and accurate information and ii) linkage files that are corrupted as we often see in raw dataset.

Workflow

The package has several main types of functions:

  1. Acquire Error Flags, which extracts and classifies error from a real linkage file into binary flags. These flags allow us to learn the occurrence of errors and consequently, replicate the error in the synthetic linkage files.
  2. Add Error Flags, which allows us to add random or dependent errors to our real gold standard file in the case we do not have access to the corrupted files but have access to error statistics.
  3. Mask Sensitive Variables, which replace sensitive variables with variables from another published database.
  4. Synthetic Data Generator, which learns the statistics and dependencies of variables and sample synthetic data from the learned models.
  5. Synthetic Data Evaluation, which compares the synthetic data with the real data and gives the visual comparison and predictive comparison of the quality of the synthetic data.
  6. Custom Generation Rules, which allows us to generate synthetic variables that are not included in the real dataset.
  7. Damage Actions, which corrupt the dataset based on the types of errors occurred.

These functions can be organised as:



Vignette

We also provide three vignettes to show how we can use the package:

  • Vignette Synthetic_Data_Generation_and_Evaluation shows how to generate synthetic data and to evaluate the quality of the synthetic data.
  • Vignette Generation_of_Gold_Standard_File_and_Linkage_Files shows how to generate synthetic gold standard and linkage files when we have access to non-sensitive predictor variables.
  • Vignette From_Sensitive_Real_Identifiers_to_Synthetic_Identifiers shows how to generate synthetic identifiers when we have access to sensitive identifiers.

Copy Link

Version

Install

install.packages('sdglinkage')

Monthly Downloads

4

Version

0.1.0

License

MIT + file LICENSE

Maintainer

Haoyuan Zhang

Last Published

April 27th, 2020

Functions in sdglinkage (0.1.0)

acquire_error_flag

Add a column of error flags given two data frames.
do_typo_replacement

Replace a string with its typo error.
gen_cart

Generate synthetic data using CART.
add_dependent_error

Add two dependent error flags to a data frame.
gen_bn_learn

Generate synthetic data using BN learning.
compare_sdg

Compare the performance of generators.
get_transformation_trans_char

Randomly transpose two neighbouring characters.
get_transformation_trans_date

Transpose the position of day and month.
do_pho_replacement

Replace a string with its phonetic error.
adult

Adult dataset.
replace_firstname

Replace the firstnames with values from another database.
do_ocr_replacement

Replace a string with its ocr error.
get_transformation_typo

Encode typographic error to a string.
slavo_germanic

Detect if it has slavo transformation.
get_transformation_name_variant

Randomly assign a name to its variant.
plot_compared_sdg

Plot the distribution of a varaible from the synthetic data comparing with the real data.
plot_bn

Plot the BN structure.
get_address

Get an address.
firstname_uk_variant

First name variants in the UK.
get_transformation_insert

Insert a character/digit/space/symbol randomly.
get_transformation_del

Delete a character randomly.
replace_lastname

Replace the lastnames with values from another database.
firstname_us

First names in the US census.
gen_address

Generate an address.
gen_dob

Generate a record of date of birth.
get_transformation_ocr

Encode OCR error to a string.
gen_bn_elicit

Generate synthetic data using BN parameter learning with an elicted structure.
gen_lastname

Randomly generate a lastname.
split_data

Split the data into a training_set and a testing_set.
get_transformation_pho

Encode phonetic error to a string.
compare_two_df

Compare two data frames.
gen_firstname

Randomly generate a firstname.
ocr_rules

Look up table of Optical Character Recognition (OCR) errors.
damage_gold_standard

Generate a linkage file by damaging the gold standard file.
replace_nhsid

Replace nhsid with another random nhsid.
gen_nhsid

Generate a random nhsid.
pho_rules

Look up table of phonetic errors.
lastname_uk

Last names in UK,
lastname_uk_variant

Last name variants in the UK.
lastname_us

Last names in the US census.
bn_flag_inference

Bayesian inference for error prediction .
compare_cart

Compare the synthetic data generated by CART with the real data.
add_variable

Add a synthetic but realistic variable to a dataset following some rules.
check_swap_char

Check if two strings are the same after we swaped the position of two letters.
add_random_error

Add random error flags to a data frame.
address_uk

UK addresses.
diff_two_strings

Find all letters in string1 which are not in string2. diff_two_strings is adopted from package vecsets function vsetdiff, it returns all letters in string1 which are not in string2.
extract_address

Extract addresses.
firstname_uk

Baby birth first names in England and Wales.