Learn R Programming

GenomicFeatures (version 1.18.7)

extractTranscriptSeqs: Extract transcript sequences from chromosomes

Description

extractTranscriptSeqs is a generic function for extracting transcript sequences from an object representing a single chromosome (e.g. a DNAString object) or a collection of chromosomes (e.g. a BSgenome object).

Usage

extractTranscriptSeqs(x, transcripts, ...)
"extractTranscriptSeqs"(x, transcripts, strand="+")
"extractTranscriptSeqs"(x, transcripts)

Arguments

x
A DNAString or BSgenome object or any object with a getSeq method.
transcripts
An object representing the exon ranges of each transcript to extract. It must be an RangesList object when x is a DNAString object.

It must be a GRangesList or TxDb object when x is a BSgenome object. If the latter, it's first turned into a GRangesList object with exonsBy(transcripts, by="tx", use.names=TRUE).

...
Additional arguments, for use in specific methods.
strand
Only supported when x is a DNAString object.

Can be an atomic vector, a factor, or an Rle object, in which case it indicates the strand of each transcript (i.e. all the exons in a transcript are considered to be on the same strand). More precisely: it's turned into a factor (or factor-Rle) that has the "standard strand levels" (this is done by calling the strand function on it). Then it's recycled to the length of RangesList object transcripts if needed. In the resulting object, the i-th element is interpreted as the strand of all the exons in the i-th transcript.

strand can also be a list-like object, in which case it indicates the strand of each exon, individually. Thus it must have the same shape as RangesList object transcripts (i.e. same length plus strand[[i]] must have the same length as transcripts[[i]] for all i).

strand can only contain "+" and/or "-" values. "*" is not allowed.

Value

A DNAStringSet object parallel to transcripts, that is, the i-th element in the returned object is the sequence of the i-th transcript in transcripts.

See Also

  • The transcriptLocs2refLocs function for converting transcript-based locations into reference-based locations.

  • The available.genomes function in the BSgenome package for checking avaibility of BSgenome data packages (and installing the desired one).
  • The GRangesList class defined and documented in the GenomicRanges package.

  • The RangesList class defined and documented in the IRanges package.

  • The exonsBy function for extracting exon ranges grouped by transcript.

  • The DNAString and DNAStringSet classes defined and documented in the Biostrings package.

  • The translate function in the Biostrings package for translating DNA or RNA sequences into amino acid sequences.

Examples

Run this code
## ---------------------------------------------------------------------
## A FIRST EXAMPLE
## ---------------------------------------------------------------------

## Load a genome:
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19

## Load a TxDb object:
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite",
                         package="GenomicFeatures")
txdb <- loadDb(txdb_file)

## Check that 'txdb' is based on the hg19 assembly:
txdb

## Extract the exon ranges grouped by transcript from 'txdb':
transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)

## Extract the transcript sequences from the genome:
tx_seqs <- extractTranscriptSeqs(genome, transcripts)
tx_seqs

## A sanity check:
stopifnot(identical(width(tx_seqs), unname(sum(width(transcripts)))))

## ---------------------------------------------------------------------
## USING extractTranscriptSeqs() TO EXTRACT CDS SEQUENCES
## ---------------------------------------------------------------------

cds <- cdsBy(txdb, by="tx", use.names=TRUE)
cds_seqs <- extractTranscriptSeqs(genome, cds)
cds_seqs

## A sanity check:
stopifnot(identical(width(cds_seqs), unname(sum(width(cds)))))

## Note that, alternatively, the CDS sequences can be obtained from the
## transcript sequences by removing the 5' and 3' UTRs:
five_utr_width <- sum(width(fiveUTRsByTranscript(txdb, use.names=TRUE)))
five_utr_width <- five_utr_width[names(cds_seqs)]
five_utr_width[is.na(five_utr_width)] <- 0L
three_utr_width <- sum(width(threeUTRsByTranscript(txdb, use.names=TRUE)))
three_utr_width <- three_utr_width[names(cds_seqs)]
three_utr_width[is.na(three_utr_width)] <- 0L
cds_seqs2 <- narrow(tx_seqs[names(cds_seqs)],
                    start=five_utr_width+1L,
                    end=-(three_utr_width+1L))
stopifnot(identical(as.character(cds_seqs), as.character(cds_seqs2)))

## ---------------------------------------------------------------------
## TRANSLATE THE CDS SEQUENCES
## ---------------------------------------------------------------------

prot_seqs <- translate(cds_seqs, if.fuzzy.codon="solve")

## Note that, by default, translate() uses The Standard Genetic Code to
## translate codons into amino acids. However, depending on the organism,
## a different genetic code might be needed to translate CDS sequences
## located on the mitochodrial chromosome. For example, for vertebrates,
## the following code could be used to correct 'prot_seqs':
SGC1 <- getGeneticCode("SGC1")
chrM_idx <- which(all(seqnames(cds) == "chrM"))
prot_seqs[chrM_idx] <- translate(cds_seqs[chrM_idx], genetic.code=SGC1,
                                 if.fuzzy.codon="solve")

Run the code above in your browser using DataLab