Read DNA Sequences in a File
This function reads DNA sequences in a file, and returns a matrix or a
list of DNA sequences with the names of the taxa read in the file as
rownames or names, respectively. By default, the sequences are stored
in binary format, otherwise (if
as.character = "TRUE") in lower
read.dna(file, format = "interleaved", skip = 0, nlines = 0, comment.char = "#", seq.names = NULL, as.character = FALSE)
- a file name specified by either a variable of mode character, or a double-quoted string.
- a character string specifying the format of the DNA
sequences. Three choices are possible:
"fasta", or any unambiguous abbreviation of these.
- the number of lines of the input file to skip before beginning to read data.
- the number of lines to be read (by default the file is read untill its end).
- a single character, the remaining of the line after this character is ignored.
- the names to give to each sequence; by default the names read in the file are used.
- a logical controlling whether to return the
sequences as an object of class
This function follows the interleaved and sequential formats defined in PHYLIP (Felsenstein, 1993) but with the original feature than there is no restriction on the lengths of the taxa names (though a data file with 10-characters-long taxa names is fine as well). For these two formats, the first line of the file must contain the dimensions of the data (the numbers of taxa and the numbers of nucleotides); the sequences are considered as aligned and thus must be of the same lengths for all taxa. For the FASTA format, the conventions defined in the URL below (see References) are followed; the sequences are taken as non-aligned. For all formats, the nucleotides can be arranged in any way with blanks and line-breaks inside (with the restriction that the first ten nucleotides must be contiguous for the interleaved and sequential formats, see below). The names of the sequences are read in the file unless the `seq.names' option is used. Particularities for each format are detailed below.
- a matrix or a list (if
format = "fasta") of DNA sequences stored in binary format, or of mode character (if
as.character = "TRUE").
Anonymous. FASTA format description.
Anonymous. IUPAC ambiguity codes.
Felsenstein, J. (1993) Phylip (Phylogeny Inference Package) version
3.5c. Department of Genetics, University of Washington.
### a small extract from `data(woddmouse)' cat("3 40", "No305 NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT", "No304 ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT", "No306 ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT", file = "exdna.txt", sep = "") ex.dna <- read.dna("exdna.txt", format = "sequential") str(ex.dna) ex.dna ### the same data in interleaved format... cat("3 40", "No305 NTTCGAAAAA CACACCCACT", "No304 ATTCGAAAAA CACACCCACT", "No306 ATTCGAAAAA CACACCCACT", "ACTAAAANTT ATCAGTCACT", "ACTAAAAATT ATCAACCACT", "ACTAAAAATT ATCAATCACT", file = "exdna.txt", sep = "") ex.dna2 <- read.dna("exdna.txt") ### ... and in FASTA format cat("> No305", "NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT", "> No304", "ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT", "> No306", "ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT", file = "exdna.txt", sep = "") ex.dna3 <- read.dna("exdna.txt", format = "fasta") ### These are the same! identical(ex.dna, ex.dna2) identical(ex.dna, ex.dna3) unlink("exdna.txt") # clean-up