## Read FASTA (or FASTQ) files in an XStringSet object:
readBStringSet(filepath, format="fasta", nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE)
readDNAStringSet(filepath, format="fasta", nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE)
readRNAStringSet(filepath, format="fasta", nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE)
readAAStringSet(filepath, format="fasta", nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE)
## Extract basic information about FASTA (or FASTQ) files
## without actually loading the sequence data:
fasta.seqlengths(filepath, nrec=-1L, skip=0L, seek.first.rec=FALSE, seqtype="B", use.names=TRUE)
fasta.index(filepath, nrec=-1L, skip=0L, seek.first.rec=FALSE, seqtype="B")
fastq.geometry(filepath, nrec=-1L, skip=0L, seek.first.rec=FALSE)
## Write an XStringSet object to a FASTA (or FASTQ) file:
writeXStringSet(x, filepath, append=FALSE, compress=FALSE, compression_level=NA, format="fasta", ...)
## Serialize an XStringSet object:
saveXStringSet(x, objname, dirpath=".", save.dups=FALSE, verbose=TRUE)
Note that special values like ""
or "|cmd"
(typically
supported by other I/O functions in R) are not supported here.
Also filepath
cannot be a connection.
"fasta"
(the default) or "fastq"
.
TRUE
or FALSE
(the default). If TRUE
, then the
reading function starts by setting the file position indicator at the
beginning of the first line in the file that looks like the beginning of
a FASTA (if format
is "fasta"
) or FASTQ (if format
is "fastq"
) record. More precisely this is the first line in the
file that starts with a '>' (for FASTA) or a '@' (for FASTQ). An error
is raised if no such line is found.Normal parsing then starts from there, and everything happens like if the file actually started there. In particular it will be an error if this first record is not a valid FASTA or FASTQ record.
Using seek.first.rec=TRUE
is useful for example to parse GFF3
files with embedded FASTA data.
"B"
for anything i.e. any letter is a valid one-letter
sequence code.
"DNA"
for DNA sequences i.e. only letters in
DNA_ALPHABET
(case ignored) are valid
one-letter sequence codes.
"RNA"
for RNA sequences i.e. only letters in
RNA_ALPHABET
(case ignored) are valid
one-letter sequence codes.
"AA"
for Amino Acid sequences. Currently treated as
"B"
but this will change in the near future i.e. only
letters in AA_ALPHABET
(case ignored) will be
valid one-letter sequence codes.
Invalid one-letter sequence codes are ignored with a warning.
writeXStringSet
, the object to write to file
. For saveXStringSet
, the object to serialize.
TRUE
or FALSE
. If TRUE
output will be
appended to file
; otherwise, it will overwrite the contents
of file
. See ?cat
for the details.
save
function in base R, must be TRUE
or
FALSE
(the default), or a single string specifying whether writing
to the file is to use compression.
The only type of compression supported at the moment is "gzip"
. Passing TRUE
is equivalent to passing "gzip"
.
format="fasta"
, the width
argument (single integer)
can be used to specify the maximum number of letters per line of
sequence.
If format="fastq"
, the qualities
argument (BStringSet
object) can be used to specify the qualities. If the qualities are
omitted, then the fake quality ';' is assigned to each letter in
x
and written to the file.
TRUE
or FALSE
.
If TRUE
then the Dups
object describing
how duplicated elements in x
are related to each other is
saved too. For advanced users only.
TRUE
or FALSE
.
readDNAStringSet
and family (i.e. readBStringSet
,
readDNAStringSet
, readRNAStringSet
and readAAStringSet
)
load sequences from an input file (or multiple input files) into an
XStringSet object. When multiple input files are specified,
all must have the same format (i.e. FASTA or FASTQ) and files with
different compression types can be mixed with non-compressed files.
The files are read in the order they were specified and the sequences
are stored in the returned object in the order they were read.
Only FASTA and FASTQ files are supported for now. The read qualities
stored in FASTQ files are ignored by readDNAStringSet
and family.
When multiple input FASTQ files are specified, all must have the same
"width" (i.e. all their sequences must have the same length).
The fasta.seqlengths
utility returns an integer vector with one
element per FASTA record in the input files. Each element is the length
of the sequence found in the corresponding record, that is, the number of
valid one-letter sequence codes in the record. See description of the
seqtype
argument above for how to control the set of valid
one-letter sequence codes.
The fasta.index
utility returns a data frame with 1 row per
FASTA record in the input files and the following columns:
recno
: The rank of the record in the (virtually) concatenated
input files.
fileno
: The rank of the file where the record is located.
offset
: The offset of the record relative to the start of the
file where it's located. Measured in bytes.
desc
: The description line (a.k.a. header) of the record.
seqlength
: The length of the sequence in the record (not
counting invalid letters).
filepath
: The path to the file where the record is located.
Always a local file, so if the user specified a remote file, this
column will contain the path to the downloaded file.
A subset of this data frame can be passed to readDNAStringSet
and family for direct access to an arbitrary subset of sequences. More
precisely, if fai
is a FASTA index that was obtained with
fasta.index(filepath, ..., seqtype="DNA")
, then
readDNAStringSet(fai[i, ])
is equivalent to
readDNAStringSet(filepath, ...)[i]
for any valid subscript i
,
except that the former only loads the requested sequences in memory
and thus will be more memory efficient if only a small subset of sequences
is requested.
The fastq.geometry
utility returns an integer vector describing
the "geometry" of the FASTQ files i.e. a vector of length 2 where the
first element is the total number of FASTQ records in the files and
the second element the common "width" of these files (this width is
NA
if the files contain no FASTQ records or records with
different widths).
writeXStringSet
writes an XStringSet object to a file.
Like with readDNAStringSet
and family, only FASTA and FASTQ
files are supported for now.
WARNING: Please be aware that using writeXStringSet
on a
BStringSet object that contains the '\n' (LF) or '\r' (CR)
characters or the FASTA markup characters '>' or ';' is almost
guaranteed to produce a broken FASTA file!
Serializing an XStringSet object with saveXStringSet
is equivalent to using the standard save
mechanism. But it will
try to reduce the size of x
in memory first before calling
save
. Most of the times this leads to a much reduced size on disk.
## ---------------------------------------------------------------------
## A. READ/WRITE FASTA FILES
## ---------------------------------------------------------------------
## Read a non-compressed FASTA files:
filepath1 <- system.file("extdata", "someORF.fa", package="Biostrings")
fasta.seqlengths(filepath1, seqtype="DNA")
x1 <- readDNAStringSet(filepath1)
x1
## Read a gzip-compressed FASTA file:
filepath2 <- system.file("extdata", "someORF.fa.gz", package="Biostrings")
fasta.seqlengths(filepath2, seqtype="DNA")
x2 <- readDNAStringSet(filepath2)
x2
## Sanity check:
stopifnot(identical(as.character(x1), as.character(x2)))
## Read 2 FASTA files at once:
filepath3 <- system.file("extdata", "fastaEx.fa", package="Biostrings")
fasta.seqlengths(c(filepath2, filepath3), seqtype="DNA")
x23 <- readDNAStringSet(c(filepath2, filepath3))
x23
## Sanity check:
x3 <- readDNAStringSet(filepath3)
stopifnot(identical(as.character(x23), as.character(c(x2, x3))))
## Use a FASTA index to load only an arbitrary subset of sequences:
filepath4 <- system.file("extdata", "dm3_upstream2000.fa.gz",
package="Biostrings")
fai <- fasta.index(filepath4, seqtype="DNA")
head(fai)
head(fai$desc)
i <- sample(nrow(fai), 10) # randomly pick up 10 sequences
x4 <- readDNAStringSet(fai[i, ])
## Sanity check:
stopifnot(identical(as.character(readDNAStringSet(filepath4)[i]),
as.character(x4)))
## Write FASTA files:
out23a <- tempfile()
writeXStringSet(x23, out23a)
out23b <- tempfile()
writeXStringSet(x23, out23b, compress=TRUE)
file.info(c(out23a, out23b))$size
## Sanity checks:
stopifnot(identical(as.character(readDNAStringSet(out23a)),
as.character(x23)))
stopifnot(identical(readLines(out23a), readLines(out23b)))
## ---------------------------------------------------------------------
## B. READ/WRITE FASTQ FILES
## ---------------------------------------------------------------------
filepath <- system.file("extdata", "s_1_sequence.txt",
package="Biostrings")
fastq.geometry(filepath)
readDNAStringSet(filepath, format="fastq")
library(BSgenome.Celegans.UCSC.ce2)
## Create a "sliding window" on chr I:
sw_start <- seq.int(1, length(Celegans$chrI)-50, by=50)
sw <- Views(Celegans$chrI, start=sw_start, width=10)
my_fake_shortreads <- as(sw, "XStringSet")
my_fake_ids <- sprintf("ID%06d", seq_len(length(my_fake_shortreads)))
names(my_fake_shortreads) <- my_fake_ids
my_fake_shortreads
## Fake quality ';' will be assigned to each base in 'my_fake_shortreads':
out2 <- tempfile()
writeXStringSet(my_fake_shortreads, out2, format="fastq")
## Passing qualities thru the 'qualities' argument:
my_fake_quals <- rep.int(BStringSet("DCBA@?>=<;"),
length(my_fake_shortreads))
my_fake_quals
out3 <- tempfile()
writeXStringSet(my_fake_shortreads, out3, format="fastq",
qualities=my_fake_quals)
## ---------------------------------------------------------------------
## C. SERIALIZATION
## ---------------------------------------------------------------------
saveXStringSet(my_fake_shortreads, "my_fake_shortreads", dirpath=tempdir())
Run the code above in your browser using DataLab