Learn R Programming

Tandem Repeat Analysis from Capillary Electrophoresis (trace)

This package provides a pipeline for short tandem repeat instability analysis from capillary electrophoresis fragment analysis data (e.g fsa files). The main function processes the data and then and then repeat instability metrics are calculated (i.e. expansion index or average repeat gain).

This code is not for clinical use. There are features for accurate repeat sizing if you use validated control samples, but integer repeat units are not returned.

To report bugs or feature requests, please visit the Github issue tracker here. For assistance or any other inquires, contact Zach McLean.

If you use this package, please cite this paper for now.

How to use the package

Either you can use the code described below, or our Shiny app to use an interactive non-coding version.

In this package, each sample is represented by an R6 ‘fragments’ object, which are organized in lists. You usually don’t need to interact with the objects directly. If you do, the attributes of the objects can be accessed with $.

There are several important factors to a successful repeat instability experiment and things to consider when using this package:

  • (required) Each sample has a unique id, usually the file name

  • (optional) Baseline control for your experiment. For example, specifying a sample where the modal allele is the inherited repeat length (eg a mouse tail sample) or a sample at the start of a time-course experiment. This is indicated with a TRUE in the metrics_baseline_control column of the metadata. Samples are then grouped together with the metrics_group_id column of the metadata. Multiple samples can be metrics_baseline_control, which can be helpful for the average repeat gain metric to have a more accurate representation of the average repeat at the start of the experiment.

  • (optional) Batch or repeat length correction for systematic batch effects that occur with repeat-containing amplicons in capillary electrophoresis.

    • Repeat containing amplicons do not run linearly with internal ladder sizes in capillary electrophoresis resulting is an underestimation of repeat length if you just convert from base-pair size. These differences are not always consistent across runs which can result in batch effects in the repeat size. So, if the repeat length is to be directly compared for samples from different runs, this batch effect needs to be corrected. This is only relevant when the absolute size of a amplicons are compared for grouping metrics as described above (otherwise instability metrics are all relative and it doesn’t matter that there’s systematic batch effects across runs), when plotting traces from different runs, or if an accurate repeat length is desired.

    • There are two main correction approaches that are somewhat related: either ‘batch’ or ‘repeat’ in call_repeats(). Batch correction is relatively simple and just requires you to link samples across batches by indicating them from metadata. But even though the repeat size that is return will be precise, it will not be accurate and underestimates the real repeat length. By contrast, repeat correction can be used to accurately call repeat lengths (which also corrects the batch effects). However, the repeat correction will only be as good as your sample(s) used to call the repeat length, so this can a challenging and advanced feature. You need to use a sample that reliably returns the same peak as the modal peak, or you need to be willing to understand the shape of the distribution and manually validate the repeat length of each control sample for each run.

  • If starting from fsa files, the GeneScan™ 1200 LIZ™ dye Size Standard ladder assignment may not work very well due to how the ladder assignment algorithm works. It is optimized for scenarios where all peaks of the ladder are resolved, which is usually the case for GeneScan™ 500 LIZ™ or GeneScan™ 600 LIZ™. To work in this package, Ladders like 1200 LIZ™ need to be run on the instrument in such a way that all of the peaks are resolved, otherwise they all blend together at the end. However, these ladders can be fixed by playing with the various parameters (or supplying a truncated version of the GeneScan™ 1200 LIZ™) or manually with the built-in fix_ladders_interactive() app.

Installation

Install the package from CRAN:

install.packages("trace")

Import data

Load the package:

library(trace)

First, we read in the raw data. In this case we will used example data within this package, but usually this would be fsa files that are read in using read_fsa(). The example data is also cloned since the next step modifies the object in place.

fsa_list <- lapply(cell_line_fsa_list, function(x) x$clone())

Alternatively, this is where you would use data exported from Genemapper if you would rather use the Genemapper bp sizing and peak identification algorithms.

fragments_list_genemapper <- genemapper_table_to_fragments(example_data,
  dye_channel = "B",
  min_size_bp = 300
)

Process Samples with trace

The trace() function streamlines the processing of fragment analysis data, from ladder assignment to repeat calling. Below is an overview of the key steps that are within trace():

1. Add Metadata

Metadata is used to enable advanced functionality, such as batch correction, repeat correction, and index peak assignment. Prepare a .csv file with the following columns:

Column NamePurposeDescription
unique_idRequired to link up the metadata file with samplesUnique identifier for each sample (e.g., file name). Must be unique across all runs.
metrics_group_idGroup samples for instability metrics (e.g., expansion index)Group ID for samples sharing a common baseline (e.g., mouse ID or experiment group).
metrics_baseline_controlIdentify baseline samples (e.g., inherited repeat length or day 0)Set to TRUE for baseline control samples (e.g., mouse tail or starting time point).
batch_run_idGroup samples by run for batch or repeat correctionIdentifier for each fragment analysis run (e.g., date).
batch_sample_idLink samples across runs for batch or repeat correctionUnique ID for each sample across runs.
batch_sample_modal_repeatSpecify validated repeat lengths for repeat correctionValidated modal repeat length for samples used in repeat correction.

2. Assign Ladders

Ladder peaks are identified in the ladder channel, and base pair (bp) sizes are predicted for each scan.

  • Visual Inspection: Always inspect ladders to ensure correct assignment.
  • Manual Adjustment: If needed, use fix_ladders_interactive() to adjust ladder assignments interactively.

3. Find Fragments

Fragment peaks are identified in the raw trace data. This step transitions the data from a continuous trace to a peak-based representation.

4. Identify Alleles

  • Identify the main allele (modal peak) for each sample. Can handle samples with two alleles, but metrics are only calculated for the larger of the two.

5. Call Repeats

  • Base pair sizes are converted to repeat lengths. This step has a lot of additional functionality including:
    • Batch correction or accurate repeat sizing.
    • Forcing whole repeat units between peaks (usually underestimated in fragment analysis data)

5. Assign Index Peaks

The index peak is the reference repeat length used for instability metrics like expansion index.

  • Grouped Assignment: Set grouped = TRUE to use the modal peak of baseline control samples (e.g., mouse tail or day 0).
  • Manual Override: Use index_override_dataframe to manually assign index peaks if needed.

Main function

To carry out this pipeline, call the main function:


fragments_list <- trace(
  fsa_list, 
  min_bp_size = 300,
  grouped = TRUE, 
  metadata_data.frame = metadata,
  show_progress_bar = FALSE)

We can validate that the index peaks were assigned correctly with a dotted vertical line added to the trace. This is perhaps more useful in the context of mice where you can visually see when the inherited repeat length should be in the bimodal distribution.

plot_traces(fragments_list[1], xlim = c(110, 150))

Calculate instability metrics

All of the information we need to calculate the repeat instability metrics has been identified. We can use calculate_instability_metrics() to generate a dataframe of per-sample metrics.

metrics_grouped_df <- calculate_instability_metrics(
  fragments_list = fragments_list,
  peak_threshold = 0.05,
  window_around_index_peak = c(-40, 40)
)

These metrics can then be used to quantify repeat instability. For example, this reproduces a subset of Figure 7e of our manuscript.

library(ggplot2)
library(dplyr)

metrics_grouped_df |>
  left_join(metadata, by = join_by(unique_id)) |>
  filter(
    day > 0,
    modal_peak_signal > 500
  ) |>
  ggplot(aes(as.factor(treatment), average_repeat_change)) +
    geom_boxplot() +
    geom_jitter() +
    labs(y = "Average repeat gain",
         x = "Branaplam (nM)") 

Copy Link

Version

Install

install.packages('trace')

Monthly Downloads

180

Version

1.0.0

License

MIT + file LICENSE

Maintainer

Zachariah McLean

Last Published

December 22nd, 2025

Functions in trace (1.0.0)

plot_repeat_correction_model

Plot Repeat Correction Model
plot_data_channels

plot_data_channels
plot_ladders

Plot ladder
remove_fragments

Remove Samples from List
plot_batch_correction_samples

Plot correction samples
plot_traces

Plot sample traces
plot_fragments

Plot Peak Data
read_fsa

Read fsa file
repeat_table_to_fragments

Convert Repeat Table to Repeats Fragments
size_table_to_fragments

Convert Size Table to Fragments
trace

Main function for sample processing
metadata

metadata
trace-package

trace: Tandem Repeat Analysis by Capillary Electrophoresis
extract_fragments

Extract All Fragments
extract_alleles

Extract Modal Peaks
add_metadata

Add Metadata to Fragments List
extract_ladder_summary

Extract ladder summary
calculate_instability_metrics

Calculate Repeat Instability Metrics
example_data_repeat_table

example_data_repeat_table
assign_index_peaks

Assign index peaks
call_repeats

Call Repeats for Fragments
genemapper_table_to_fragments

Convert Genemapper Peak Table to fragments class
extract_repeat_correction_summary

Extract repeat correction summary
fix_ladders_manual

Fix ladders manually
fix_ladders_interactive

Fix ladders interactively
fragments

fragments object
find_ladders

Ladder and bp sizing
extract_trace_table

Extract traces
load_config

Load Trace Config File
find_fragments

Find fragment peaks
find_alleles

Find Alleles
example_data

example_data
cell_line_fsa_list

A list of fsa files