SynDist: SynDist() function

Description

SynDist identifies and quantifies synonymous variation among aligned protein-coding DNA sequences, that is, nucleotide substitutions that do not translate to changes in the amino acid sequences, due to degeneracy of the genetic code.

Usage

SynDist(
  seq_file,
  path_out,
  input_fasta = NULL,
  codon_pos = NULL,
  analysis = "dist"
)

Value

When analysis="dist", the function produces a .csv distance matrix with the number of synonymous substitutions in each pairwise sequence comparison in the upper right matrix and the synonymous p-distance in each pairwise sequence comparison in the lower left matrix. If a sequence occurrence table is given as input file, the function additionally produces two tables with the mean number of synonymous substitutions and the mean synonymous p-distance across all pairwise sequence comparisons for each sample in the data set. If a sequence occurrence table is given as input file, the sequences are named in the output matrix by an index number that corresponds to their column number in the input file. If analysis="codon", the function produces two .csv summary tables, one with the total number of synonymous substitutions per nucleotide position across all pairwise sequence comparisons and one with the number of synonymous codon variations per codon across all pairwise sequence comparisons. Note that in the codon summary table, the synonymous codon variation does not quantify the number of nucleotide variations between the synonymous codons, since that can be derived from the nucleotide summary table. Each summary table also contains a column that specifies the proportion of the observed number of synonymous variations (per nucleotide position or codon) out of the number of pairwise sequence comparisons. E.g., if three sequences are compared and a synonymous substitution is observed for a given codon once (i.e., between two of the three sequences), that gives a proportion of synonymous observations of one out of three pairwise sequence comparisons for that codon.

Arguments

seq_file: is a sequence occurrence table as output by the 'dada2' pipeline, which has samples in rows and nucleotide sequences in columns. Optionally, a fasta file can be supplied as input in the format rendered by read.fasta() from the package 'seqinr'.
path_out: is a user defined path to the folder where the output files will be saved.
input_fasta: optional, a logical (TRUE/FALSE) that indicates whether the input file is a fasta file (TRUE) or a 'dada2'-style sequence table (NULL/FALSE). The default is NULL/FALSE.
codon_pos: is optional, a vector of comma separated integers specifying which codons to include in analyses. If omitted, analyses are made using all codons. Note: With SynDist(), codon_pos should always be specified as codons, i.e. numbered nucleotide triplets in open reading frame.
analysis: is used to specify the desired kind of analysis. It takes the values 'dist' for quantification of pairwise synonymous variation between sequences, or 'codon' for quantification of synonymous substitutions per nucleotide or codon position. The argument is optional with 'dist' as default.

Details

The SynDist() function takes a fasta file or a 'dada2'-style sequence occurrence table (with aligned sequences as column names and samples in rows) as input and identifies synonymous variation by pairwise sequence comparisons.

SynDist() can do qualitative or quantitative analysis of synonymous variation. If analysis="codon" is specified, the function identifies synonymous nucleotide variation and outputs tables with the number of observations of synonymous nucleotide changes per base and per codon among all pairwise sequence comparisons in the data set. These tables also specify, for each base or codon position, the proportions of the total pairwise comparisons that harbor synonymous substitutions.

If analysis="dist", the function produces a distance matrix specifying the number and proportion (p-distance) of synonymous nucleotide changes in each pairwise sequence comparison in the data set. In the distance matrix, synonymous p-distance is calculated as the number of synonymous nucleotide changes observed in each pairwise sequence comparison divided by the sequence length (number of bases). If a 'dada2'-style sequence occurrence table is provided as input, the SynDist() function furthermore produces two tables with the mean number of synonymous variations and mean synonymous p-distances among all pairwise comparisons of the sequences in each sample in the data set. (Note: The means will be NA for samples that have 0 or 1 sequence(s).)

The SynDist() function includes an option for the user to specify which codons to compare. This is useful e.g. if the sequences contain gaps in some codons, which should be excluded from quantitative analysis.

SynDist() translates the supplied DNA sequences to amino acid sequences using the standard genetic code and sequences must be aligned in open reading frame. The function only accepts the following characters in the sequences: -,a,t,g,c,A,T,G,C

Nucleotide triplets containing gaps (-) are translated to 'X', similar to stop codons. Please note that '-' are treated as unique characters in p- distance calculations. The function will give warnings if gaps or stop codons are detected. If you wish to exclude stop codons or gaps from distance calculations, please use the codon_pos option to specify which codons to compare.

If you publish data or results produced with MHCtools, please cite both of the following references: Roved, J. (2022). MHCtools: Analysis of MHC data in non-model species. Cran. Roved, J. (2024). MHCtools 1.5: Analysis of MHC sequencing data in R. In S. Boegel (Ed.), HLA Typing: Methods and Protocols (2nd ed., pp. 275–295). Humana Press. https://doi.org/10.1007/978-1-0716-3874-3_18

Examples

Run this code

seq_file <- sequence_table_SynDist
path_out <- tempdir()
SynDist(seq_file, path_out, input_fasta=NULL,codon_pos=c(1,2,3,4,5,6,7,8),
analysis="dist")