Learn R Programming

phylosamp

This repository provides code for the phylosamp R package, which was designed to help users conduct and evaluate sample size calculations for phylogenetic studies. Presently, the functions can be used to calculate sample size in three types of scenarios that frequently arise when analyzing pathogen genomic data: (1) trying to determine if pathogen infections are linked by transmission (linkage scenario); (2) trying to estimate the frequency of a known pathogen lineage or variant of concern (variant tracking scenario); (3) trying to determine if pathogen transmissibility differs between groups of infected hosts (relative R scenario).

All key functions of each scenario are documented, along with realistic examples, in the associated vignettes. Vignettes are organized as follows:

  • Transmission Linkage Vignettes (L1-L4): linkage scenario vignettes and examples
  • Variant Tracking Vignettes (V1-V6): variant tracking scenario vignettes and examples
  • Differential Transmission Vignettes: relative R scenario vignettes coming soon!

Determining linkage between pathogen infections

The package includes a suite of functions that can be used to determine the ability of some phylogenetic criteria (e.g., genetic distance) to correctly identify pairs of pathogen infections linked by transmission, given a particular sample size or proportion. The package also includes functions that do the reverse: calculate the sample size needed to correctly identify true pairs at some particular rate.

The functions used to calculate the sample size or false discovery rate of the criteria require as input an estimate of the sensitivity and specificity of the linkage criteria used. Therefore, the current implementation of the package also includes functions to estimate the sensitivity and specificity of genetic distance as a linkage criteria, from the mutation rate of the pathogen. Future implementations may contain guidance on how to estimate these parameters for other phylogenetic criteria.

All functions require the user to specify the underlying assumptions about transmissions and linkage, i.e., if an infected individual can transmit to more than one susceptible individual (single transmission/multiple transmissions), and if the criteria being used is capable of linking a case to more than one other case (single linkage/multiple linkage). Permitting multiple transmissions and multiple links ('mtml') is the default.

A detailed description of the linkage methods can be found in:

Sample Size Calculation for Phylogenetic Case Linkage (Wohl, Giles, and Lessler 2020)

Determining the frequency of a pathogen variant

The package includes another set of functions that can be used to determine the sample size needed to detect or estimate the frequency of a pathogen variant in a population. It also includes functions that do the reverse: calculate the confidence in a detection or frequency estimate, given a number of samples.

These functions require the user to specify a desired confidence in the results (either probability of detection or confidence in prevalence estimate), a desired or estimated variant prevalence, and (if applicable) a desired precision in the prevalence estimate. The user can also provide variant-specific parameters to help account for biases in variant detection, such as the probability that an infection is asymptomatic, the testing sensitivity, the testing probability, and the sequencing success rate.

Functions are provided for sample size calculations in a cross-sectional scenario, where a single batch of samples will be collected and sequenced, and in a periodic surveillance scenario, where samples are collected and sequenced repeatedly at some regular interval. In the latter case, the user must also provide information on how the variant frequency may be changing over time (in the form of an initial frequency and estimated logistic growth rate).

All calculations assume a two-variant system; in other words, that there is a particular variant of interest that may behave differently from the rest of the pathogen population (in terms of asymptomatic rate, testing sensitivity, etc.). Detection biases due to these differences are incorporated into function calculations. That said, the framework could be easily extended to a multi-variant system in the future.

A detailed description of the VOC estimation methods can be found in:

Sample Size Calculations for Variant Surveillance in the Presence of Biological and Systematic Biases (Wohl, Lee, DiPrete, and Lessler 2022)

Estimating differential transmission between groups

Finally, the package includes an additional set of functions that can be used to determine the sample size needed to detect differential transmission between groups of potential pathogen hosts. It also includes functions that do the reverse: calculate the power given parameters such as the effect size and the number of samples.

We assume the user is interested in detecting differential transmission between two groups of individuals, denoted as A and B, which can optionally be of different sizes. The main function allows the user to specify the estimated reproductive number in group A, the estimated reproductive number in group B, the proportion of the infected population that are in group A (equivalent to 1 minus the proportion in group B), the total size of the outbreak, the desired type 1 error rate, and the desired power. These parameters are then used to calculate the sample size needed to detect if the reproductive number differs between groups A and B.

The user can specify whether they are interested in a one-sided or two-sided hypothesis test and whether there is any linkage misclassification, via sensitivity and specificity parameters. Additionally, the user can optionally specify an overdispersion parameter if overdispersion is suspected in the transmission process.

A detailed description of the differential transmission methods can be found in:

Power and Samples Size Calculations for Testing the Ratio of Reproductive Values in Phylogenetic Samples (D'Agostino McGowan, Wohl, and Lessler 2023)

Installation

The phylosamp package is available for download on CRAN.

To install the install the development version of the phylosamp package, first install the devtools package and then install phylosamp from source via GitHub:

install.packages('devtools')
devtools::install_github('HopkinsIDD/phylosamp')

Troubleshooting

This package is maintained by Elizabeth Lee (@eclee25), Shirlee Wohl (@shwohl), and Justin Lessler (@jlessler).

For general questions, contact Shirlee Wohl (swohl@scripps.edu), or Justin Lessler (jlessler@unc.edu).

To report bugs or problems with documentation, please go to the Issues page associated with this GitHub page and click new issue.

Copy Link

Version

Install

install.packages('phylosamp')

Monthly Downloads

267

Version

1.0.1

License

GPL-2

Issues

Pull Requests

Stars

Forks

Maintainer

Justin Lessler

Last Published

May 23rd, 2023

Functions in phylosamp (1.0.1)

prob_trans_mtsl

Probability of transmission assuming multiple-transmission and single-linkage
optim_roc_threshold

Find optimal ROC threshold
relR_power

Calculate power for detecting differential transmission given a sample size
obs_pairs_stsl

Expected number of observed pairs assuming single-transmission and single-linkage
relR_power_simulated

Simulate power for detecting differential transmission
prob_trans_stsl

Probability of transmission assuming single-transmission and single-linkage
relR_samplesize_basic

Calculate simple derived sample size for detecting differential transmission
prob_trans_mtml

Probability of transmission assuming multiple-transmission and multiple-linkage
phylosamp-package

phylosamp: Sample Size Calculations for Molecular and Phylogenetic Studies
relR_samplesize_solve

Calculate optimal sample size for detecting differential transmission with imperfect specificity
relR_samplesize_ci

Calculate sample size for detecting differential transmission with uncertainty bounds
samplesize

Calculate sample size
sens_spec_calc

Calculate sensitivity and specificity
translink_expected_links_obs

Calculate expected number of transmission links in a sample
translink_expected_links_obs_mtml

Calculate expected number of observed pairs assuming multiple-transmission and multiple-linkage
translink_prob_transmit_mtsl

Calculate probability of transmission assuming multiple-transmission and single-linkage
translink_prob_transmit_mtml

Calculate probability of transmission assuming multiple-transmission and multiple-linkage
relR_samplesize_linkerr

Calculate sample size for detecting differential transmission correcting for sensitivity and specificity
translink_expected_links_true

Calculate expected number of true transmission pairs
translink_expected_links_true_mtml

Calculate expected number of true transmission pairs assuming multiple-transmission and multiple-linkage
translink_expected_links_true_mtsl

Calculate expected number of true transmission pairs assuming multiple-transmission and single-linkage
translink_expected_links_true_stsl

Calculate expected number of true transmission pairs assuming single-transmission and single-linkage
sens_spec_roc

Make ROC from sensitivity and specificity
true_pairs_mtml

Expected number of true transmission pairs assuming multiple-transmission and multiple-linkage
true_pairs_mtsl

Expected number of true transmission pairs assuming multiple-transmission and single-linkage
varfreq_expected_mbias

Calculate multiplicative bias (observed / actual) in variant prevalence
varfreq_cdf_logistic

Calculate cumulative observed variant prevalence at time t given logistic growth
translink_prob_transmit_stsl

Calculate probability of transmission assuming single-transmission and single-linkage
translink_samplesize

Calculate sample size needed to identify true transmission links
translink_fdr

Calculate false discovery rate of identifying transmission pairs in a sample
translink_prob_transmit

Calculate probability of transmission
vartrack_prob_detect_cont

Calculate probability of detecting a variant given a per-timestep sample size assuming periodic sampling
vartrack_samplesize_detect_cont

Calculate sample size needed for variant detection assuming periodic sampling
true_pairs

Calculate expected number of true transmission pairs
translink_tdr

Calculate true discovery rate of identifying transmission pairs
vartrack_prob_detect_xsect

Calculate probability of detecting a variant assuming cross-sectional sampling
vartrack_samplesize_detect

Calculate sample size needed for variant detection given a desired probability of detection
varfreq_obs_freq

Calculate observed variant prevalence
varfreq_freq_logistic

Calculate observed variant prevalence at time t given logistic growth
vartrack_cod_ratio

Calculate the coefficient of detection ratio for two variants
vartrack_prob_detect

Calculate the probability of detecting a variant given a sample size
vartrack_samplesize_prev_xsect

Calculate sample size needed for variant prevalence estimation under cross-sectional sampling
relR_samplesize_opterr

Function to calculate the error in estimated sample size for use in optimize function
vartrack_samplesize_prev

Calculate sample size needed for estimating variant prevalence given a desired confidence
vartrack_samplesize_detect_xsect

Calculate sample size needed for variant detection assuming cross-sectional sampling
vartrack_prob_prev_xsect

Calculate confidence in a variant estimate assuming cross-sectional sampling
translink_expected_links_obs_stsl

Calculate expected number of observed pairs assuming single-transmission and single-linkage
true_pairs_stsl

Expected number of true transmission pairs assuming single-transmission and single-linkage
vartrack_prob_prev

Calculate confidence in a variant estimate given a sample size
relR_samplesize_simsolve

Calculate optimized sample size for detecting differential transmission
translink_expected_links_obs_mtsl

Calculate expected number of observed pairs assuming multiple-transmission and single-linkage
truediscoveryrate

Calculate true discovery rate of a sample
gendist_sensspec_cutoff

Calculate sensitivity and specificity of a genetic distance cutoff
genDistSim

Simulations of the genetic distance distribution
get_optim_roc

Find optimal ROC threshold
gendist_distribution

Calculate genetic distance distribution
falsediscoveryrate

Calculate false discovery rate of a sample
gen_dists

Calculate genetic distance distribution
exp_links

Calculate expected number of links in a sample
obs_pairs_mtml

Expected number of observed pairs assuming multiple-transmission and multiple-linkage
gendist_roc_format

Make ROC curve from sensitivity and specificity
obs_pairs_mtsl

Expected number of observed pairs assuming multiple-transmission and single-linkage
relR_samplesize

Calculate sample size needed to detect differential transmission