predictCoding: Predict amino acid coding changes for variants

Description

Predict amino acid coding changes for variants a coding regions

Usage

"predictCoding"(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
"predictCoding"(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
"predictCoding"(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
"predictCoding"(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)

Arguments

query

A VCF, Ranges or GRanges instance containing the variants to be annotated. If a Ranges is provided it will be coerced to a GRanges. If a VCF is provided the GRanges from the rowData slot will be used. All elementMetadata columns are ignored.

When the query is not a VCF object a varAllele must be provided. The varAllele must be a DNAStringSet the same length as the query. If there are multiple alternate alleles per variant the query must be expanded to one row per each alternate allele. See examples.

subject

A TxDb object that serves as the annotation. GFF files can be converted to TxDb objects with makeTranscriptDbFromGFF() in the GenomicFeatures package.

seqSource

A BSgenome instance or an FaFile to be used for sequence extraction.

varAllele

A DNAStringSet containing the variant (alternate) alleles. The length of varAllele must equal the length of query. Missing values are represented by a zero width empty element. Ranges with missing varAllele values are ignored; those with an ‘N’ character are not translated.

When the query is a VCF object the varAllele argument will be missing. This value is taken internally from the VCF with alt().

...

Additional arguments passed to methods. Arguments genetic.code and if.fuzzy.codon are supported for the translate function.

ignore.strand

A logical indicating if strand should be ignored when performing overlaps.

Value

A GRanges with a row for each variant-transcript match. The result includes only variants that fell within coding regions. The strand of the output GRanges represents the strand of the subject hit.At a minimum, the elementMetadata columns include,

varAllele: Variant allele. This value is reverse complemented for an unstranded query that overlaps a subject on the negative strand.
QUERYID: Map back to the row in the original query
TXID: Internal transcript id from the annotation
CDSID: Internal coding region id from the annotation
GENEID: Internal gene id from the annotation
CDSLOC: Variant location in coding region-based coordinates. This position is relative to the start of the coding (cds) region defined in the subject annotation.
PROTEINLOC: Variant codon triplet location in coding region-based coordinates. This position is relative to the start of the coding (cds) region defined in the subject annotation.
CONSEQUENCE: Possible values are `synonymous', `nonsynonymous', `frameshift', `nonsense' and `not translated'. Variant sequences are translated only when the codon sequence is a multiple of 3. The value will be `frameshift' when a sequence is of incompatible length and it will be `not translated' when the varAllele is missing or there is an ‘N’ in the sequence. `nonsense' is used for premature stop codons.
REFCODON: The reference codon sequence. This range is typically greater than the width of the range in the GRanges because it includes all codons involved in the sequence modification. If the reference sequence is of width 2 but the alternate allele is of width 4 then at least two codons must be included in the REFSEQ.
VARCODON: This sequence is the result of inserting, deleting or replacing the position(s) in the reference sequence alternate allele. If the result of this modifiction is not a multiple of 3 no translation is performed and the VARAA value will be missing.
REEFAA: The reference amino acid column contains the translated REFSEQ.
VARAA: The variant amino acid column contains the translated VARSEQ. When translation is not possible this value is missing.

Details

This function returns the amino acid coding for variants that fall completely `within' a coding region. The reference sequences are taken from a fasta file or BSgenome. The width of the reference is determined from the start postion and width of the range in the query. For guidance on how to represent an insertion, deletion or substitution see the 1000 Genomes VCF format (references).

Variant alleles are taken from the varAllele when supplied. When the query is a VCF object the varAllele will be missing. This value is taken internally from the VCF with alt(). The variant allele is substituted into the reference sequences and transcribed. Transcription only occurs if the substitution, insertion or deletion results in a new sequence length divisible by 3.

When the query is an unstranded (*) GRanges predictCoding will attempt to find overlaps on both the positive and negative strands of the subject. When the subject hit is on the negative strand the return varAllele is reverse complemented. The strand of the returned GRanges represents the strand of the subject hit.

References

http://vcftools.sourceforge.net/specs.html

Examples

Run this code

  library(BSgenome.Hsapiens.UCSC.hg19)
  library(TxDb.Hsapiens.UCSC.hg19.knownGene)
  txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene 

  ## ----------------------------
  ## VCF object as query 
  ## ----------------------------
  ## Read variants from a VCF file 
  fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
  vcf <- readVcf(fl, "hg19")

  ## Rename seqlevels in the VCF object to match those in the TxDb.
  vcf <- renameSeqlevels(vcf, "chr22")
  ## Confirm common seqlevels
  intersect(seqlevels(vcf), seqlevels(txdb))
 
  ## When 'query' is a VCF object the varAllele argument is missing.
  coding1 <- predictCoding(vcf, txdb, Hsapiens)
  head(coding1, 3)

  ## Exon-centric or cDNA locations:
  exonsbytx <- exonsBy(txdb, "tx")
  cDNA <- mapCoords(coding1[!duplicated(ranges(coding1))], exonsbytx)
  coding1$cDNA <- ranges(cDNA)[togroup(coding1$QUERYID)]

  ## ----------------------------
  ## GRanges object as query 
  ## ----------------------------
  ## Alternatively, a GRanges can be the 'query' to predictCoding().
  ## The seqlevels were previously adjusted in the VCF object so the GRanges
  ## extracted from rowData() has the correct seqlevels.
  rd <- rowData(vcf)
 
  ## The GRanges must be expanded to have one row per alternate allele. 
  ## Variants 1, 2 and 10 have two alternate alleles.
  altallele <- alt(vcf)
  eltlen <- elementLengths(altallele)
  rd_exp <- rep(rd, eltlen)
 
  ## Call predictCoding() with the expanded GRanges as the 'query'
  ## and the unlisted alternate allele as the 'varAllele'.
  coding2 <- predictCoding(rd_exp, txdb, Hsapiens, unlist(altallele)) 

  identical(coding1, coding2)

Run the code above in your browser using DataLab