Learn R Programming

OTrecod package

A package dedicated to data fusion

Introduction

The OTrecod package gives access to a set of original functions dedicated to data fusion.

 

Package installation

If the package OTrecod is not installed in their current R versions, users can install it by following the standard instruction:

install.packages("OTrecod")

Obviously, each time an R session is opened, the OTrecod library must be loaded with:

library(OTrecod)

Moreover, the development version of OTrecod can be installed actually from GitHub with:

# Install development version from GitHub
devtools::install_github("otrecoding/OTrecod")

 

Database examples and expected structure before data fusion

data(simu_data)
dim(simu_data)
[1] 700   8
simu_data[c(1:5,301:305),]
    DB     Yb1 Yb2 Gender Treatment Dosage Smoking      Age
1    A [40-60[  NA Female     Trt A  Dos 3     YES 65.44273
2    A [20-40]  NA   Male      <NA>  Dos 2      NO 51.78596
3    A [40-60[  NA Female   Placebo  Dos 2     YES 49.10844
4    A [40-60[  NA Female     Trt B  Dos 4    <NA> 56.43524
5    A [40-60[  NA Female     Trt A  Dos 4     YES 44.77365
301  B    <NA>   5 Female   Placebo  Dos 2     YES 44.58233
302  B    <NA>   1 Female     Trt B  Dos 4    <NA> 65.23921
303  B    <NA>   2 Female   Placebo   <NA>      NO 51.64228
304  B    <NA>   2 Female     Trt A   <NA>      NO 50.15125
305  B    <NA>   1 Female     Trt B  Dos 4     YES 61.53242

 

Support functions

merge_dbs

The merge_dbs function is a pre-process data fusion function dedicated to the harmonization of two data sources. By default, variables (not target variables) with same labels are considered as shared between the two databases. The merge_dbs function detects potential discrepancies between the variables before merging by:

  • firstly excluding variables with different labels from the first database to the second one and inversely.
  • excluding a priori shared variables with different types.
  • excluding a priori shared factors with different levels.

The actual form of the function does not propose automatic reconciliation actions to reintroduce the problematic variables but gives user enough information in output to do it by himself if necessary. The call of the merge_dbs function is actually:

merge_dbs = function(DB1, DB2, row_ID1 = NULL, row_ID2 = NULL, NAME_Y, NAME_Z, order_levels_Y = levels(DB1[, NAME_Y]), order_levels_Z = levels(DB2[, NAME_Z]), ordinal_DB1 = NULL, ordinal_DB2 = NULL,
                     impute = "NO", R_MICE = 5, NCP_FAMD = 3, seed_choice = sample(1:1000000, 1))

 

select_pred

The select_pred function is a pre-process data fusion function dedicated to the selection of matching variables. This selection is essential when the initial set of shared variables is important, but also because the choice of predictors greatly influences the quality of the data fusion whatever the optimal transportation algorithms chosen a posteriori.

The call of the select_pred function is actually:

select_pred = function(databa,Y = NULL, Z = NULL, ID = 1, OUT = "Y", quanti = NULL, nominal = NULL, ordinal = NULL, logic = NULL,
                       convert_num = NULL, convert_class = NULL, thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
                       RF = TRUE, RF_ntree = 500, RF_condi = FALSE, RF_condi_thr = 0.20, RF_SEED = sample(1:1000000, 1))

 

verif_OT

The call of the verif_OT function is actually:

verif_OT = function(ot_out, group.class = FALSE, ordinal = TRUE, stab.prob = FALSE, min.neigb = 1, R = 10, seed.stab = sample(1:1000000, 1))

 

Optimal transportation functions

The OTrecod package provides two algorithms that use optimal transportation theory to solve recoding problems in data fusion contexts (see (1) and (2) for more details). Each algorithm is stored in one function and each function provides in output a unique and synthetic database where the two initial data sources are overlayed and the missing information from only one or both target variables are fully completed.

Each of the two alogorithms also proposed enrichments by relaxing the initial distributional constraints and adding regularization terms as described in (2).

 

OT_outcome

The OT_outcome function can provide individual predictions of the incomplete target variables by considering the recoding problem involving only optimal transportation of outcomes (see (1) and (2) for more details).

The call of the OT_outcome function is:

OT_outcome = function(datab, index_DB_Y_Z = 1:3, quanti = NULL, nominal = NULL, ordinal = NULL,logic = NULL,
                      convert.num = NULL, convert.class = NULL, FAMD.coord = "NO", FAMD.perc = 0.8,
                      dist.choice = "E", percent.knn = 1, maxrelax = 0, indiv.method = "sequential",
                      prox.dist = 0.30, solvR = "glpk", which.DB = "BOTH")

 

OT_joint

The OT_joint function can provide individual predictions of the incomplete target variables by considering the recoding problem involving optimal transportation of shared variables and outcomes (see(2) for more details).

The call of the OT_joint function is:

OT_joint = function(datab, index_DB_Y_Z = 1:3, nominal = NULL, ordinal = NULL,logic = NULL,
                    convert.num = NULL, convert.class = NULL, dist.choice = "E", percent.knn = 1,
                    maxrelax = 0, lambda.reg = 0.0, prox.X = 0.10, solvR = "glpk", which.DB = "BOTH")

 

References

  1. Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics.Volume 16, Issue 1, 20180106, eISSN 1557-4679.

  2. Gares V, Omer J (2020). Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association.

Copy Link

Version

Install

install.packages('OTrecod')

Monthly Downloads

145

Version

0.1.2

License

GPL-3

Maintainer

Gregory Guernec

Last Published

October 5th, 2022

Functions in OTrecod (0.1.2)

ham

ham()
power_set

power_set()
tab_test

A simulated dataset to test the library
indiv_grp_optimal

indiv_grp_optimal()
simu_data

A simulated dataset to test the functions of the OTrecod package
select_pred

select_pred()
merge_dbs

merge_dbs()
transfo_dist

transfo_dist()
proxim_dist

proxim_dist()
ncds_5

National Child Development Study: a sample of the fifth wave of data collection
ncds_14

National Child Development Study: a sample of the first four waves of data collection
verif_OT

verif_OT()
transfo_quali

transfo_quali()
transfo_target

transfo_target()
error_group

error_group()
avg_dist_closest

avg_dist_closest()
imput_cov

imput_cov()
compare_lists

compare_lists()
OT_outcome

OT_outcome()
indiv_grp_closest

indiv_grp_closest()
OT_joint

OT_joint()
api35

Student performance in California schools: the results of the county 35
api29

Student performance in California schools: the results of the county 29