utr.annotation R package

This package can be used to annotate potential deleterious variants in the UTR regions for both human and mouse species. Given a CSV or VCF format variant file, utr.annotation provides information of each variant on whether and how it alters disrupts six known translational regulators including: upstream Open Reading Frames (uORFs), upstream Kozak sequences, polyA signals, the Kozak sequence at the annotated translation initiation site, start codon, and stop codon, conservation scores in the variant position, and whether and how it changes ribosome loading based on a model from empirical data.

The package has been tested on Linux and MacOS (Intel chip). If you are on Windows, getting conservation scores and ribosome loading prediction may not work, you could skip those two.

Input

Here is an example input file

Chr	Pos	Ref	Alt
chr1	1308643	CAT	C
chr12	69693767	G	T
chr12	4685084	A	G
chr15	45691267	A	G

Feature implementation

Here is a description of the key terms

Term	Definition & Implementation
uAUG	A "ATG" sequence in 5' UTR
# uAUG	Number of uAUG in 5' UTR
uKozak	A 7nt sequence [GA]..ATGG in 5' UTR
# uKozak	Number of uKozak sequences in 5' UTR
PolyA signal	A 6nt sequence AATAAA or ATTAAA
# PolyA signal	Number of PolyA signal in 3' UTR
Stop codon	A stop codon of a transcript is the last three nucleotides of its CDS.
Lost stop codon	The tool annotates whether a variant disrupt the stop codon. If the alternative stop codon sequence of a transcript is not "TAG", "TAA", or "TGA", the tool will annotate lost_stop_codon as TRUE.
Start codon	A start codon of a transcript is the first three nucleotides of its CDS
Lost start codon	The tool annotates whether a variant disrupt the start codon. If the alternative start codon sequence is not "ATG", the tool will annotate lost_start_codon as TRUE.
TSS Kozak	A TSS Kozak sequence of a transcript is defined in this paper as a 8nt sequence, the first three nuleotides are the last three nuleotides of its 5’ UTR sequence and the following five nuleotides are the first five nuleotides of its coding sequence.
TSS Kozak score	We empirically defined a Position Weight Matrix for an ideal Kozak by using the top 20 Kozak’s described by Sample et al., 2019 from uORFs that recruited ribosomes most efficiently. We then compare the TSS Kozak of the ref and alt to this PWM to calculate a score for each, and to determine if a variant alters the Kozak score.
MRL	Mean ribosome load. A MRL of a transcript is predicted using a CNN model with 100nt 5' UTR sequence upstream of its start codon. The CNN was trained based on a massively parallel reporter assay of 5’UTR sequence’s impact on ribosome loading from (Sample, 2019).

UTR annotation results

Here is a description of the key annotation columns output from the UTR annotation.

Column name	Description	Values
Transcript	whether the variant overlaps with any protein coding transcript, if so, list all transcript ids	NA or a list of transcript ids, separated by ";"
transcript_id	whether any transcript in Transcript column has a valid start codon and stop codon, if so, list all the valid transcript ids	NA or a list of transcript ids, separated by ";"
utr3_transcript_id	Whether the variant overlaps with the 3' UTR of any transcript, if so, list all transcripts ids	NA / a list of transcript ids, separated by ";"
utr5_transcript_id	Whether the variant overlaps with the 5' UTR of any transcript, if so, list all transcripts ids	NA / a list of transcript ids, separated by ";"
cds_transcript_id	Whether the a variant overlaps with the CDS of any transcript, if so, list all transcripts ids	NA / a list of transcript ids, separated by ";"
num_uAUG	If utr5_transcript_id is NA, also NA here; Otherwise, list the counts of AUG in 5' UTR of each transcript listed in utr5_transcript_id	NA / a list of numbers, separated by ";"
num_uAUG_altered	If utr5_transcript_id is NA, also NA here; Otherwise, list the counts of AUG in altered 5' UTR of each transcript listed in utr5_transcript_id	NA / a list of numbers, separated by ";"
utr_num_uAUG_gainedOrLost	If utr5_transcript_id is NA, also NA here; Otherwise, check the counts difference before and after alter	NA / a list of comparison results (equal / gained / lost)
num_kozak	If utr5_transcript_id is NA, also NA here; Otherwise, list the counts of the Kozak sequence in 5' UTR of each transcript listed in utr5_transcript_id	NA / a list of numbers, separated by ";"
num_kozak_altered	If utr5_transcript_id is NA, also NA here; Otherwise, list the counts of the Kozak sequence in altered 5' UTR of each transcript listed in utr5_transcript_id	NA / a list of numbers, separated by ";"
utr_num_kozak_gainedOrLost	If utr5_transcript_id is NA, also NA here; Otherwise, check the counts difference before and after alter	NA / a list of comparison results (equal / gained / lost)
mrl	If utr5_transcript_id is NA, also NA here; Otherwise, list the MRL prediction on 5' UTR sequences of each transcript listed in utr5_transcript_id	NA / a list of MRL predictions, separated by ";"
mrl_altered	If utr5_transcript_id is NA, also NA here; Otherwise, list the MRL prediction on altered 5' UTR sequence of each transcript listed in utr5_transcript_id	NA / a list of MRL predictions, separated by ";"
mrl_gainedOrLost	If utr5_transcript_id is NA, also NA here; Otherwise, check if gain or lost ribosome load after alteration	NA / a list of comparison results (equal / gained / lost)
num_polyA_signal	If utr3_transcript_id is NA, also NA here; Otherwise, list the counts of the polyA sequence in 3' UTR of each transcript listed in utr3_transcript_id	NA / a list of numbers, separated by ";"
num_polyA_signal_altered	If utr3_transcript_id is NA, also NA here; Otherwise, list the counts of the polyA sequence in altered 3' UTR of each transcript listed in utr3_transcript_id	NA / a list of numbers, separated by ";"
num_polyA_signal_gainedOrLost	If utr3_transcript_id is NA, also NA here; Otherwise, check the counts difference before and after alter	NA / a list of comparison results (equal / gained / lost)
stopCodon_transcript_id	Whether it is a variant in stop codon region, if so, list all transcripts ids	NA / a list of transcript ids, separated by ";"
stopCodon_positions	If stopCodon_transcript_id is NA, also NA here; Otherwise, list the three stop codons' coordinates (separated by "\|" ) of each transcript listed in stopCodon_transcript_id	NA / a list of three-numbers, separated by ";"
stop_codon	If stopCodon_transcript_id is NA, also NA here; Otherwise, list the stop codon sequence of each transcript listed in stopCodon_transcript_id	NA / a list of sequences, separated by ";"
stop_codon_altered	If stopCodon_transcript_id is NA, also NA here; Otherwise, list the altered stop codon sequence of each transcript listed in stopCodon_transcript_id	NA / a list of sequences, separated by ";"
lost_stop_codon	If stopCodon_transcript_id is NA, also NA here; Otherwise, check if lost the stop codon after alteration	NA / a list of boolean (TRUE/FALSE) separated by ";"
startCodon_transcript_id	Whether it is a variant in start codon region, if so, list all transcripts ids	NA / a list of transcript ids, separated by ";"
startCodon_positions	If startCodon_transcript_id is NA, also NA here; Otherwise, list the three start codons' coordinates (separated by "\|" ) of each transcript listed in startCodon_transcript_id	NA / a list of three-numbers, separated by ";"
start_codon	If startCodon_transcript_id is NA, also NA here; Otherwise, list the start codon sequence of each transcript listed in startCodon_transcript_id	NA / a list of sequences, separated by ";"
start_codon_altered	If startCodon_transcript_id is NA, also NA here; Otherwise, list the altered start codon sequence of each transcript listed in startCodon_transcript_id	NA / a list of sequences, separated by ";"
lost_start_codon	If startCodon_transcript_id is NA, also NA here; Otherwise, check if lost the start codon after alteration	NA / a list of boolean (TRUE/FALSE) separated by ";"
kozak_transcript_id	Whether the variant overlaps with the Kozak region at translation initiation site of any transcript, if so, list all transcripts ids	NA / a list of transcript ids, separated by ";"
kozak_positions	If kozak_transcript_id is NA, also NA here; Otherwise, list the coordinates of each nucleotide in the Kozak sequence (separated by "\|" ) for each transcript listed in kozak_transcript_id	NA / a list of eight-numbers, separated by ";"
kozak	If kozak_transcript_id is NA, also NA here; Otherwise, list the Kozak sequence of each transcript listed in kozak_transcript_id	NA / a list of sequences, separated by ";"
kozak_altered	If kozak_transcript_id is NA, also NA here; Otherwise, list the altered Kozak sequence of each transcript listed in kozak_transcript_id	NA / a list of sequences, separated by ";"
kozak_score	If kozak_transcript_id is NA, also NA here; Otherwise, list the score of the Kozak sequence of each transcript listed in kozak_transcript_id	NA / a list of numbers, separated by ";"
kozak_altered_score	If kozak_transcript_id is NA, also NA here; Otherwise, list the score of the altered Kozak sequence of each transcript listed in kozak_transcript_id	NA / a list of numbers, separated by ";"
tss_kozak_score_gainedOrLost	If kozak_transcript_id is NA, also NA here; Otherwise, check the Kozak score difference before and after alter	NA / a list of comparison results (equal / gained / lost), separated by ";"

Installation

Install dependencis

cran_pkgs <- c("parallel", "doParallel", "data.table", "readr", "stringr", "vcfR", "dplyr", "tidyr", "keras", "devtools", "reticulate")
bioc_pkgs <- c("biomaRt", "Biostrings", "AnnotationHub", "ensembldb")

for (pkg in cran_pkgs) {
  if (!(pkg %in% installed.packages())) {
    install.packages(pkg)
  }
}

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
}

for (pkg in bioc_pkgs) {
  if (!(pkg %in% installed.packages())) {
    BiocManager::install(pkg)
  }
}

# Install keras package for MRL prediction. 
library(keras)
# Install tensorflow backend
reticulate::install_miniconda()
keras::install_keras(version = "2.2.4", tensorflow = "1.14.0", method = "conda")

# install deep learning model data package
devtools::install_bitbucket("jdlabteam/mrl.dl.model")

Install utr.annotation package

# Install the release version from CRAN
install.packages("utr.annotation")

# Or install the latest version from Bitbucket
devtools::install_bitbucket("jdlabteam/utr.annotation")

Usage

Introduction to utr.annotation package

Citation

Y Liu, JD Dougherty. 2021. utR.annotation: a tool for annotating genomic variants that could influence post-transcriptional regulation. bioRxiv doi: 10.1101/2021.06.23.449510

utr.annotation R package

Input

Feature implementation

UTR annotation results

Installation

Install dependencis

Install utr.annotation package

Usage

Citation

Copy Link

Version

Install

Monthly Downloads

Version

License

Maintainer

Last Published

Functions in utr.annotation (1.0.4)