Joins selected variables and/or whole tables from the tabulated data/shadow
files into a single data frame. Expects the data files to be stored in one
directory in .parquet or .tsv format, with one file per table following
the naming convention of the respective NBDC dataset (from the ABCD or HBCD
studies). Typically, this will be the rawdata/phenotype/ directory within
a BIDS dataset downloaded from the NBDC Data Hub.
Variables specified in vars and tables will be full-joined together,
i.e., all rows will be kept, even if they do not have a value for all
columns. Variables specified in vars_add will be left-joined to the
variables selected in vars and tables, i.e., only the values for already
existing rows will be added and no new rows will be created. This is useful
for adding variables to the dataset that are important for a given analysis
but are not the main variables of interest (e.g., design/nesting or
demographic information). By left-joining these variables, one avoids
creating new rows that contain only missing values for the main variables of
interest selected using vars and tables. If the same variables are
specified in vars/tables and vars_add/tables_add, the variables in
vars_add/tables_add will be ignored.
In addition to the main join_tabulated() function, there are two
study-specific variations:
join_tabulated_abcd(): for the ABCD study.
join_tabulated_hbcd(): for the HBCD study.
They have the same arguments as the join_tabulated() function, except
that the study argument is set to the respective study by default, and
should not be set by the user.
join_tabulated(
dir_data,
study,
vars = NULL,
tables = NULL,
vars_add = NULL,
tables_add = NULL,
release = "latest",
format = "parquet",
shadow = FALSE,
remove_empty_rows = TRUE,
bypass_ram_check = FALSE
)join_tabulated_abcd(...)
join_tabulated_hbcd(...)
A tibble of data or shadow matrix with the joined variables.
character. Path to the directory with the data files in
.parquet or .tsv format.
character. NBDC study (One of "abcd" or "hbcd").
character (vector). Name(s) of variable(s) to be joined.
(Default: NULL, i.e., no variables are selected; one of tables or
vars has to be provided).
character (vector). Name(s) of table(s) to be joined (Default:
NULL, i.e., no tables are selected; one of tables or vars has to be
provided).
character (vector). Name(s) of additional variable(s) to be
left-joined to the variables selected in vars and tables (Default:
NULL, i.e., no additional variables are selected)
character (vector). Name(s) of additional table(s) to be
left-joined to the variables selected in vars and tables (Default:
NULL, i.e., no additional tables are selected)
character. Release version (Default: "latest")
character. Data format (One of "parquet" or "tsv"; default:
"parquet").
logical. Whether to join the shadow matrix
instead of the data table (default: FALSE).
logical. Whether to filter out rows that have
all values missing in the joined variables, except for the
ID columns (default: TRUE).
logical. If TRUE, the function will not abort
if the number of variables exceeds 10000 and current available RAM is
less than 75% of the estimated RAM usage. This can prevent the long
loading time of the data, but failing in the middle due to insufficient RAM.
For large datasets, it is recommended to save 2 times or more of
estimated RAM before running this function.
This argument is only used for the ABCD study, as the HBCD data is small enough to be loaded without RAM issues with most personal computers. As HBCD data grows in the future, this may change.
Additional arguments passed to the underlying function
join_tabulated()
Note: Turning this parameter to FALSE is useful for shadow matrices
processing. Some shadow-related functions expect the shadow matrix to
have the same dimensions as the original data to proceed correctly.
See shadow_bind_data() and shadow_replace_binding_missing()
if (FALSE) {
join_tabulated(
dir_data = "path/to/data/",
vars = c("var_1", "var_2", "var_3"),
tables = c("table_1", "table_2"),
study = "abcd",
release = "6.0"
)
}
Run the code above in your browser using DataLab