Learn R Programming

utr.annotation R package

This package can be used to annotate potential deleterious variants in the UTR regions for both human and mouse species. Given a CSV or VCF format variant file, utr.annotation provides information of each variant on whether and how it alters disrupts six known translational regulators including: upstream Open Reading Frames (uORFs), upstream Kozak sequences, polyA signals, the Kozak sequence at the annotated translation initiation site, start codon, and stop codon, conservation scores in the variant position, and whether and how it changes ribosome loading based on a model from empirical data.

The package has been tested on Linux and MacOS (Intel chip). If you are on Windows, getting conservation scores and ribosome loading prediction may not work, you could skip those two.

Input

Here is an example input file

ChrPosRefAlt
chr11308643CATC
chr1269693767GT
chr124685084AG
chr1545691267AG

Feature implementation

Here is a description of the key terms

TermDefinition & Implementation
uAUGA "ATG" sequence in 5' UTR
# uAUGNumber of uAUG in 5' UTR
uKozakA 7nt sequence [GA]..ATGG in 5' UTR
# uKozakNumber of uKozak sequences in 5' UTR
PolyA signalA 6nt sequence AATAAA or ATTAAA
# PolyA signalNumber of PolyA signal in 3' UTR
Stop codonA stop codon of a transcript is the last three nucleotides of its CDS.
Lost stop codonThe tool annotates whether a variant disrupt the stop codon. If the alternative stop codon sequence of a transcript is not "TAG", "TAA", or "TGA", the tool will annotate lost_stop_codon as TRUE.
Start codonA start codon of a transcript is the first three nucleotides of its CDS
Lost start codonThe tool annotates whether a variant disrupt the start codon. If the alternative start codon sequence is not "ATG", the tool will annotate lost_start_codon as TRUE.
TSS KozakA TSS Kozak sequence of a transcript is defined in this paper as a 8nt sequence, the first three nuleotides are the last three nuleotides of its 5’ UTR sequence and the following five nuleotides are the first five nuleotides of its coding sequence.
TSS Kozak scoreWe empirically defined a Position Weight Matrix for an ideal Kozak by using the top 20 Kozak’s described by Sample et al., 2019 from uORFs that recruited ribosomes most efficiently. We then compare the TSS Kozak of the ref and alt to this PWM to calculate a score for each, and to determine if a variant alters the Kozak score.
MRLMean ribosome load. A MRL of a transcript is predicted using a CNN model with 100nt 5' UTR sequence upstream of its start codon. The CNN was trained based on a massively parallel reporter assay of 5’UTR sequence’s impact on ribosome loading from (Sample, 2019).

UTR annotation results

Here is a description of the key annotation columns output from the UTR annotation.

Column nameDescriptionValues
Transcriptwhether the variant overlaps with any protein coding transcript, if so, list all transcript idsNA or a list of transcript ids, separated by ";"
transcript_idwhether any transcript in Transcript column has a valid start codon and stop codon, if so, list all the valid transcript idsNA or a list of transcript ids, separated by ";"
utr3_transcript_idWhether the variant overlaps with the 3' UTR of any transcript, if so, list all transcripts idsNA / a list of transcript ids, separated by ";"
utr5_transcript_idWhether the variant overlaps with the 5' UTR of any transcript, if so, list all transcripts idsNA / a list of transcript ids, separated by ";"
cds_transcript_idWhether the a variant overlaps with the CDS of any transcript, if so, list all transcripts idsNA / a list of transcript ids, separated by ";"
num_uAUGIf utr5_transcript_id is NA, also NA here; Otherwise, list the counts of AUG in 5' UTR of each transcript listed in utr5_transcript_idNA / a list of numbers, separated by ";"
num_uAUG_alteredIf utr5_transcript_id is NA, also NA here; Otherwise, list the counts of AUG in altered 5' UTR of each transcript listed in utr5_transcript_idNA / a list of numbers, separated by ";"
utr_num_uAUG_gainedOrLostIf utr5_transcript_id is NA, also NA here; Otherwise, check the counts difference before and after alterNA / a list of comparison results (equal / gained / lost)
num_kozakIf utr5_transcript_id is NA, also NA here; Otherwise, list the counts of the Kozak sequence in 5' UTR of each transcript listed in utr5_transcript_idNA / a list of numbers, separated by ";"
num_kozak_alteredIf utr5_transcript_id is NA, also NA here; Otherwise, list the counts of the Kozak sequence in altered 5' UTR of each transcript listed in utr5_transcript_idNA / a list of numbers, separated by ";"
utr_num_kozak_gainedOrLostIf utr5_transcript_id is NA, also NA here; Otherwise, check the counts difference before and after alterNA / a list of comparison results (equal / gained / lost)
mrlIf utr5_transcript_id is NA, also NA here; Otherwise, list the MRL prediction on 5' UTR sequences of each transcript listed in utr5_transcript_idNA / a list of MRL predictions, separated by ";"
mrl_alteredIf utr5_transcript_id is NA, also NA here; Otherwise, list the MRL prediction on altered 5' UTR sequence of each transcript listed in utr5_transcript_idNA / a list of MRL predictions, separated by ";"
mrl_gainedOrLostIf utr5_transcript_id is NA, also NA here; Otherwise, check if gain or lost ribosome load after alterationNA / a list of comparison results (equal / gained / lost)
num_polyA_signalIf utr3_transcript_id is NA, also NA here; Otherwise, list the counts of the polyA sequence in 3' UTR of each transcript listed in utr3_transcript_idNA / a list of numbers, separated by ";"
num_polyA_signal_alteredIf utr3_transcript_id is NA, also NA here; Otherwise, list the counts of the polyA sequence in altered 3' UTR of each transcript listed in utr3_transcript_idNA / a list of numbers, separated by ";"
num_polyA_signal_gainedOrLostIf utr3_transcript_id is NA, also NA here; Otherwise, check the counts difference before and after alterNA / a list of comparison results (equal / gained / lost)
stopCodon_transcript_idWhether it is a variant in stop codon region, if so, list all transcripts idsNA / a list of transcript ids, separated by ";"
stopCodon_positionsIf stopCodon_transcript_id is NA, also NA here; Otherwise, list the three stop codons' coordinates (separated by "|" ) of each transcript listed in stopCodon_transcript_idNA / a list of three-numbers, separated by ";"
stop_codonIf stopCodon_transcript_id is NA, also NA here; Otherwise, list the stop codon sequence of each transcript listed in stopCodon_transcript_idNA / a list of sequences, separated by ";"
stop_codon_alteredIf stopCodon_transcript_id is NA, also NA here; Otherwise, list the altered stop codon sequence of each transcript listed in stopCodon_transcript_idNA / a list of sequences, separated by ";"
lost_stop_codonIf stopCodon_transcript_id is NA, also NA here; Otherwise, check if lost the stop codon after alterationNA / a list of boolean (TRUE/FALSE) separated by ";"
startCodon_transcript_idWhether it is a variant in start codon region, if so, list all transcripts idsNA / a list of transcript ids, separated by ";"
startCodon_positionsIf startCodon_transcript_id is NA, also NA here; Otherwise, list the three start codons' coordinates (separated by "|" ) of each transcript listed in startCodon_transcript_idNA / a list of three-numbers, separated by ";"
start_codonIf startCodon_transcript_id is NA, also NA here; Otherwise, list the start codon sequence of each transcript listed in startCodon_transcript_idNA / a list of sequences, separated by ";"
start_codon_alteredIf startCodon_transcript_id is NA, also NA here; Otherwise, list the altered start codon sequence of each transcript listed in startCodon_transcript_idNA / a list of sequences, separated by ";"
lost_start_codonIf startCodon_transcript_id is NA, also NA here; Otherwise, check if lost the start codon after alterationNA / a list of boolean (TRUE/FALSE) separated by ";"
kozak_transcript_idWhether the variant overlaps with the Kozak region at translation initiation site of any transcript, if so, list all transcripts idsNA / a list of transcript ids, separated by ";"
kozak_positionsIf kozak_transcript_id is NA, also NA here; Otherwise, list the coordinates of each nucleotide in the Kozak sequence (separated by "|" ) for each transcript listed in kozak_transcript_idNA / a list of eight-numbers, separated by ";"
kozakIf kozak_transcript_id is NA, also NA here; Otherwise, list the Kozak sequence of each transcript listed in kozak_transcript_idNA / a list of sequences, separated by ";"
kozak_alteredIf kozak_transcript_id is NA, also NA here; Otherwise, list the altered Kozak sequence of each transcript listed in kozak_transcript_idNA / a list of sequences, separated by ";"
kozak_scoreIf kozak_transcript_id is NA, also NA here; Otherwise, list the score of the Kozak sequence of each transcript listed in kozak_transcript_idNA / a list of numbers, separated by ";"
kozak_altered_scoreIf kozak_transcript_id is NA, also NA here; Otherwise, list the score of the altered Kozak sequence of each transcript listed in kozak_transcript_idNA / a list of numbers, separated by ";"
tss_kozak_score_gainedOrLostIf kozak_transcript_id is NA, also NA here; Otherwise, check the Kozak score difference before and after alterNA / a list of comparison results (equal / gained / lost), separated by ";"

Installation

Install dependencis

cran_pkgs <- c("parallel", "doParallel", "data.table", "readr", "stringr", "vcfR", "dplyr", "tidyr", "keras", "devtools", "reticulate")
bioc_pkgs <- c("biomaRt", "Biostrings", "AnnotationHub", "ensembldb")

for (pkg in cran_pkgs) {
  if (!(pkg %in% installed.packages())) {
    install.packages(pkg)
  }
}

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
}

for (pkg in bioc_pkgs) {
  if (!(pkg %in% installed.packages())) {
    BiocManager::install(pkg)
  }
}

# Install keras package for MRL prediction. 
library(keras)
# Install tensorflow backend
reticulate::install_miniconda()
keras::install_keras(version = "2.2.4", tensorflow = "1.14.0", method = "conda")

# install deep learning model data package
devtools::install_bitbucket("jdlabteam/mrl.dl.model")

Install utr.annotation package

# Install the release version from CRAN
install.packages("utr.annotation")

# Or install the latest version from Bitbucket
devtools::install_bitbucket("jdlabteam/utr.annotation")

Usage

Introduction to utr.annotation package

Citation

Y Liu, JD Dougherty. 2021. utR.annotation: a tool for annotating genomic variants that could influence post-transcriptional regulation. bioRxiv doi: 10.1101/2021.06.23.449510

Copy Link

Version

Install

install.packages('utr.annotation')

Monthly Downloads

78

Version

1.0.4

License

GPL (>= 3)

Maintainer

Yating Liu

Last Published

August 23rd, 2021

Functions in utr.annotation (1.0.4)

getCodonOneVariant

Get the codon sequence of transcripts in transcriptIdColumn for one variant.
initUTRAnnotation

Query transcripts regions and sequences from Ensembl database
getAltSequence

Check if Ref matches the actual sequence, if so then return altered sequence after replacing Ref with Alt. The sequence can have multiple fragments, only check the fragment which pos >= fragment_start & pos <= fragment_end
getKozakPWM

Get Kozak PWM
queryEnsemblInfo

query transcrips regions, utr sequences, and coding sequences from Ensembl database
checkIfGainOrLoseAfterAltOneVariant

Check if the number of uAUG, ployA signal, or Kozak in UTR changes after alt for one variant
getRegionsForDiscretePos

Get codon region table used for getAltSequence
getLatestEnsemblVersion

Get the latest Ensembl version that available by querying with both biomRt and AnnotationHub
getTranscriptIdsForOneTSSKozakVariant

Find the ensembl_transcript_id and kozak_positions for one variant in TSS Kozak region (8nt) [AG]..AUGG.
checkIfRefMatchAndAltFragment

Check if the nucleotides in Ref column match the actual nucleotides in the fragment
checkInputValid

Check if the variable table is a valid for UTR annotation
concatenateAnnotationResult

Concatenate annotation result files into one file
checkIfGainOrLoseAfterAlt

Check if the number of uAUG, ployA signal, or Kozak in UTR changes after alt
getCodon

Get the codon sequence for transcripts in transcriptIdColumn.
getSeqTable

Get UTR sequences of a list of transcripts
countDNAPatternInAltOneVariant

Count the number of DNA pattern in the altered sequence of one transcript.
predictMRLOneVariant

Predict MRL for one variant
getTrasncriptsRegions

Get information on transcript regions, UTRs regions, coding regions, chromosome name, and strand from Ensembl database
get_conservation_scores

Get conservation scores for variants
countDNAPattern

Count the number of DNA pattern for each transcript, and concatenate the numbers with ";".
checkIfLostCodonAfterAltOneVariant

Check if the codon lost after alt for one variant.
getCodonInAlt

Get the altered codon sequence for transcripts in transcriptIdColumn.
inverse_transform

Reverse the scaling of the predicted MRL
getFeatureFromInfo

Get feature information from INFO column of VCF file
getTranscriptIdsOneVariant

Search the db to find protein coding transcript ids that overlap with a variant
getTSSKozak

Get the Kozak sequence in TSS region for transcripts in transcriptIdColumn.
validateTranscripts

Checking whether a transcript is valid
countDNAPatternOneVariant

Count the number of DNA pattern for one variant
getTSSKozakOneVariant

Get the Kozak sequence in TSS region for transcripts in transcriptIdColumn for one variant.
init_backend

Create parallel backend with user specified number of CPUs
readVCFData

Read variants from a VCF formatted file
createEnsDbFromAH

Create a EnsDb database file from AnnotationHub
getOrgName

Convert species name to AnnotationHub acceptable species name
getTranscriptIdsForCodonVariants

Find the ensembl_transcript_id and codon positions for variants
getTranscriptIds

Get protein coding transcript ids that overlap with each variant in the variant table
getTranscriptIdsForOneCodonVariant

Find the ensembl_transcript_id and codon postions for one variant
readVariantData

Read variant file in CSV format
predictMRLInAltOneVariant

Predict altered MRL for one variant
one_hot_encode

Apply one hot encode to the 100nt 5' UTR sequence
splitRowsIfMultiFeature

Split rows if the feature column(s) contains multiple items, separated by sep
getEnsembl

Get Ensembl database object for the specified species
countDNAPatternInAlt

Count the number of DNA pattern in the altered sequence of each transcript.
getTranscriptIdsForUTRVariants

Find the ensembl_transcript_id for utr variants genes, or cds
predictMRLInAlt

Get the mutated 100nt 5' UTR sequence upstream of start codon for transcripts in transcriptIdColumn and predict their MRL
getCodonInAltOneVariant

Get the altered codon sequence for transcripts in transcriptIdColumn for one variant.
partitionVariantFile

Partition the variant file to run in parallel
checkIfLostCodonAfterAlt

Check if the codon lost after alt.
getTranscriptIdsForOneUTRVariant

Find all the ensembl_transcript_id for a variant which fall in the specified region for one variant
getKozakScoreOneVariant

Calculate the Kozak score for each Kozak sequence, and concatenate the numbers with ";" for one variant
runUTRAnnotation

Run UTR annotation on a variant file
getKozakScore

Calculate the Kozak score for each Kozak sequence, and concatenate the numbers with ",".
getTranscriptIdsForTSSKozakVariants

Find the ensembl_transcript_id and kozak_positions for variants in TSS Kozak region (8nt) [AG]..AUGG.
predictMRL

Get the 100nt 5' UTR sequence upstream of start codon for transcripts in transcriptIdColumn and predict their MRL