Learn R Programming

rphast (version 1.0)

read.msa: Reading an MSA Object

Description

Reads an MSA from a file.

Usage

read.msa(filename, format=c(guess.format.msa(filename), "FASTA")[1],
    alphabet=NULL, features=NULL, do.4d=FALSE, ordered=ifelse(do.4d ||
    !is.null(features), FALSE, TRUE), tuple.size=(if (do.4d) 3 else
    NULL), do.cats=NULL, refseq=NULL, offset=0, seqnames=NULL,
    discard.seqnames=NULL, pointer.only=FALSE)

Arguments

filename
The name of the input file containing an alignment.
format
input file format: one of "FASTA", "MAF", "SS", "PHYLIP", "MPM", must be correctly specified.
alphabet
the alphabet of non-missing-data chraracters in the alignment. Determined automatically from the alignment if not given.
features
An object of type feat. If provided, the return value will only contain portions of the alignment which fall within a feature. The alignment will not be ordered. The loaded regions can be further constrained with the do.4d or do.cats options
do.4d
Logical. If TRUE, the return value will contain only the columns corresponding to four-fold degenerate sties. Requires features to be specified.
ordered
Logical. If FALSE, the MSA object may not retain the original column order.
tuple.size
Integer. If given, and if pointer.only is TRUE, MSA will be stored in sufficient statistics format, where each tuple contains tuple.size consecutive columns of the alignment.
do.cats
Character vector. If given, and if features is specified, then only the types of features named here will be represented in the returned alignment.
refseq
Character string specifying a FASTA format file with a reference sequence. If given, the reference sequence will be "filled in" whereever missing from the alignment.
offset
An integer giving offset of reference sequence from beginning of chromosome. Not used for MAF format.
seqnames
A character vector. If provided, discard any sequence in the msa that is not named here. This is only implemented efficiently for MAF input files, but in this case, the reference sequence must be named.
discard.seqnames
A character vector. If provided, discard sequenced named here. This is only implemented efficiently for MAF input files, but in this case, the reference sequenced must NOT be discarded.
pointer.only
If TRUE, MSA will be stored by reference as an external pointer to an object created by C code, rather than directly in R memory. This improves performance and may be necessary for large alignments, but reduces functionality. See

Value

  • an MSA object.

See Also

msa, read.feat

Examples

Run this code
exampleArchive <- system.file("extdata", "examples.zip", package="rphast")
files <- c("ENr334.maf", "ENr334.fa", "gencode.ENr334.gff")
unzip(exampleArchive, files)

# Read a fasta file, ENr334.fa
# this file represents a 4-way alignment of the encode region
# ENr334 starting from hg18 chr6 position 41405894
idx.offset <- 41405894
m1 <- read.msa("ENr334.fa", offset=idx.offset)
m1

# Now read in only a subset represented in a feature file
f <- read.feat("gencode.ENr334.gff")
f$seqname <- "hg18"  # need to tweak source name to match name in alignment
m1 <- read.msa("ENr334.fa", features=f, offset=idx.offset)

# Can also subset on certain features
do.cats <- c("CDS", "5'flank", "3'flank")
m1 <- read.msa("ENr334.fa", features=f, offset=idx.offset,
               do.cats=do.cats)

# Can read MAFs similarly, but don't need offset because
# MAF file is annotated with coordinates
m2 <- read.msa("ENr334.maf", features=f, do.cats=do.cats)
# Also, note that when features is given and the file is
# in MAF format, the first sequence is automatically
# stripped of gaps
ncol.msa(m1)
ncol.msa(m2)
ncol.msa(m1, "hg18")

unlink(files) # clean up

Run the code above in your browser using DataLab