pharmaversesdtm
Test data (SDTM) for the pharmaverse family of packages
Purpose {#purpose}
To provide a one-stop-shop for SDTM test data in the pharmaverse family of packages. This includes datasets that are therapeutic area (TA)-agnostic (DM, VS, EG, etc.) as well TA-specific ones (RS, TR, OE, etc.).
Installation {#installation}
The package is available from CRAN and can be installed by running install.packages("pharmaversesdtm"). To install the latest development version of the package directly from GitHub use the following code:
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}
remotes::install_github("pharmaverse/pharmaversesdtm", ref = "main") # This command installs the latest development version directly from GitHub.Data Sources {#data-sources}
Some test datasets have been sourced from the CDISC pilot project, while other datasets have been constructed ad-hoc by the {admiral} team. Please check the Reference page for detailed information regarding the source of specific datasets.
Naming Conventions {#naming}
- Datasets that are TA-agnostic: same as SDTM domain name (e.g.,
dm,rs). - Datasets that are TA-specific: domain_TA_others, others go from broader categories to more specific ones (e.g.,
oe_ophtha,rs_onco,rs_onco_irecist).
Note: If an SDTM domain is used by multiple TAs, {pharmaversesdtm} may provide multiple versions of the corresponding test dataset. For instance, the package contains ex and ex_ophtha as the latter contains ophthalmology-specific variables such as EXLAT and EXLOC, and EXROUTE is exchanged for a plausible ophthalmology value.
How To Update {#how-to-update}
Firstly, make a GitHub issue in {pharmaversesdtm} with the planned updates and tag @pharmaverse/admiral so that one of the development core team can sanity check the request. Then there are two main ways to extend the test data: either by adding new datasets or extending existing datasets with new records/variables. Whichever method you choose, it is worth noting the following:
- Programs that generate test data are stored in the
data-raw/folder. - Each of these programs is written as a standalone R script: if any packages need to be loaded for a given program, then call
library()at the start of the program (but please do not calllibrary(pharmaversesdtm)). - When you have created a program in the
data-raw/folder, you need to run it as a standalone R script, in order to generate a test dataset that will become part of the{pharmaversesdtm}package, but you do not need to build the package. - Following best practice, each dataset is stored as a
.rdafile whose name is consistent with the name of the dataset, e.g., datasetxxis stored asxx.rda. The easiest way to achieve this is to useusethis::use_data(xx) - The programs in
data-raw/are stored within the{pharmaversesdtm}GitHub repository, but they are not part of the{pharmaversesdtm}package--thedata-raw/folder is specified in.Rbuildignore. - When you run a program that is in the
data-raw/folder, you generate a dataset that is written to thedata/folder, which will become part of the{pharmaversesdtm}package. - The names and sources of test datasets are specified in
R/*.R, for the purpose of generating documentation in theman/folder.
Note: The documentation process in {pharmaversesdtm} is automated for consistency and ease of maintenance.
Centralized Metadata (inst/extdata/sdtms-specs.json)
{pharmaversesdtm} uses a single JSON file to store metadata for all SDTM datasets.
This file contains information such as:
- dataset name
- dataset label
- dataset description
- author
- source
- therapeutic area
- any other dataset-specific metadata.
This metadata drives the automated documentation process, and the file is read by data-raw/create_sdtms_data.R to help generate:
- Documentation
.Rfiles inR/ .Rdfiles inman/Test Name/Test Codetable inclusion (when present)- Dataset grouping by
Therapeutic Area.
Adding New SDTM Datasets
- Create a program in the
data-raw/folder, named<name>.R, where<name>should follow the naming convention, to generate the test data and output<name>.rdato thedata/folder.- Use CDISC pilot data such as
dmas input in this program in order to create realistic synthetic data that remains consistent with other domains (not mandatory). - Note that no personal data should be used as part of this package, even if anonymized.
- Use CDISC pilot data such as
- Run the program.
- Update
inst/extdata/sdtms-specs.jsonwith the new dataset metadata, including:- Assigning the dataset label, description, author, source, purpose, or structure.
- Assigning or updating the dataset therapeutic area (used for reference-page grouping).
- Run
data-raw/create_sdtms_data.Rin order to updateNAMESPACEand update the.Rdfiles inman/. - Add your GitHub handle to
.github/CODEOWNERS. - Update
NEWS.md.
Updating Existing SDTM Datasets
- Locate the existing program
<name>.Rin thedata-raw/folder, update it accordingly. - Update the corresponding entry in
inst/extdata/sdtms-specs.jsonto reflect the changes, including:- Changing the dataset label, description, author, or source.
- Modifying the dataset purpose or structure.
- Updating the dataset therapeutic area.
- Removing a dataset (delete its entry from the JSON entirely).
- Run the program, and output updated
<name>.rdato thedata/folder. - Run
data-raw/create_sdtms_data.Rin order to updateNAMESPACEand update the.Rdfiles inman/. - Add your GitHub handle to
.github/CODEOWNERS. - Update
NEWS.md.
Acknowledgments
Along with the authors and contributors, thanks to the following people for their work on the package:
G Gayatri, Pooja Kumari, Sadchla Mascary, Kangjie Zhang and Zelos Zhu.