util.fasta: Functions for Accessing FASTA Files

Description

Search the header lines of a FASTA file, read protein sequences from a file and count numbers of amino acids in each sequence.

Usage

is.fasta(file)
  grep.file(file, pattern = "", y = NULL, ignore.case = TRUE, 
    startswith = ">", lines = NULL, grep = "grep")
  read.fasta(file, i = NULL, ret = "count", lines = NULL, 
    ihead = NULL, pnff = FALSE)
  splitline(line, length)
  trimfas(file, start, stop)

Arguments

file

character, path to FASTA file.

pattern

character, pattern to search for in header lines.

character, term to exclude in searching sequence headers.

ignore.case

logical, ignore differences between upper- and lower-case?

startswith

character, only lines starting with this expression are matched.

lines

list of character, supply the lines here instead of reading them from file.

grep

character, name of system grep command.

numeric, line numbers of sequence headers to read.

ret

character, specification for type of return (count, sequence, or FASTA format).

ihead

numeric, which lines are headers.

pnff

logical, get the protein name from the filename?

line

character, a line to be split into multiple lines.

length

numeric, the maximum length of any line.

start

numeric, starting position to extract from sequences.

stop

numeric, last position to extract from sequences.

Value

grep.file returns a numeric vector. read.fasta returns a list of sequences or lines (for ret equal to seq or fas, respectively), or a data frame with amino acid compositions of proteins (for ret equal to count) with columns corresponding to those in thermo$protein.

Side Effects

None

Details

is.fasta checks if a file is in FASTA format. A very simple test is performed: if either of the first two lines of the file starts with >, then the function returns TRUE, otherwise it returns FALSE.

grep.file is used to search for entries in a FASTA file. It returns the line numbers of the matching FASTA headers. It takes a search term in pattern and optionally a term to exclude in y. The ignore.case option is passed to grep, which does the work of finding lines that match. Only lines that start with the expression in startswith are searched; the default setting reflects the format of the header line for each sequence in a FASTA file.

If y is NULL and a supported operating system is identified, the operating system's grep function (or other specified in the grep argument) is applied directly to the file instead of R's grep. This avoids having to read the file into R using readLines. If the lines from the file were obtained in a preceding operation, they can be supplied to this function in the lines argument.

read.fasta is used to retrieve entries from a FASTA file. The line numbers for the headers of the desired sequences are passed to the function in i (they can be generated using grep.file). The function returns various formats depending on the value of ret; the default count returns a dataframe of amino acid counts (the data frame can be given to add.protein in order to add the proteins to thermo$protein), seq returns a list of sequences, and fas returns a list of lines extracted from the FASTA file, including the headers (this can be used e.g. to generate a new FASTA file with only the selected sequences). Similarly to grep.file, this function utilizes the OS's grep on supported operating systems in order to identify the header lines as well as cat to read the file, otherwise readLines and R's substr are used to read the file and locate the header lines. lines, if it is given, bypasses the reading of the file and also overrides the use of the OS's tools. If the line numbers of the header lines were previously determined, they can be supplied in ihead.

splitline takes a single character object (the line) and splits it into multiple lines of the given length (the last line can be shorter than this). It returns a character object that contains the lines. This function is utilized by trimfas, which extracts the specified positions from a (usually) aligned FASTA file. The length of the lines output by trimfas is equal to the length of the first sequence line in the given file.

Examples

Run this code

## basic use of splitline
(AA21 <- splitline("ACDEFGHIKLMNPQRSTVWYX", 10))
stopifnot(length(AA21)==3)

## reading a protein FASTA file
# the path to the file
file <- system.file("extdata/fasta/EF-Tu.aln", package="CHNOSZ")
# read the sequences, and print the first one
(seq <- read.fasta(file, ret="seq"))[[1]]
# count the amino acids in the sequences
(aa <- read.fasta(file))[1,]
stopifnot(protein.length(aa[1,])==nchar(seq[[1]]))
# extract characters 3-11 in the sequences
seqtrim <- trimfas(file, 3, 11)
# trimfas keeps all lines including the headers
# so first sequence is the second element of the vector
stopifnot(seqtrim[2]==substr(seq[[1]], 3, 11))

Run the code above in your browser using DataLab