util.fasta: Functions for Reading FASTA Files and Downloading from UniProt

Description

Search the header lines of a FASTA file, read protein sequences from a file, count numbers of amino acids in each sequence, and download sequences from UniProt.

Usage

grep.file(file, pattern = "", y = NULL, ignore.case = TRUE, 
    startswith = ">", lines = NULL, grep = "grep")
  read.fasta(file, i = NULL, ret = "count", lines = NULL, 
    ihead = NULL, pnff = FALSE, start=NULL, stop=NULL)
  count.aa(seq, start=NULL, stop=NULL)
  uniprot.aa(protein, start=NULL, stop=NULL)

Arguments

file

character, path to FASTA file

pattern

character, pattern to search for in header lines

character, term to exclude in searching sequence headers

ignore.case

logical, ignore differences between upper- and lower-case?

startswith

character, only lines starting with this expression are matched

lines

list of character, supply the lines here instead of reading them from file

grep

character, name of system grep command

numeric, line numbers of sequence headers to read

ret

character, specification for type of return (count, sequence, or FASTA format)

ihead

numeric, which lines are headers

pnff

logical, get the protein name from the filename?

start

numeric, position in sequence to start counting

stop

numeric, position in sequence to stop counting

seq

character, amino acid sequence of a protein

protein

character, entry name for protein in UniProt

Value

grep.file returns a numeric vector. read.fasta returns a list of sequences or lines (for ret equal to seq or fas, respectively), or a data frame with amino acid compositions of proteins (for ret equal to count) with columns corresponding to those in thermo$protein.

Details

grep.file returns the line numbers of header lines in a FASTA file. Matching header lines are identified having the search term pattern and optionally a term to exclude in y. The ignore.case option is passed to grep, which does the work of finding lines that match. Only lines that start with the expression in startswith are searched; the default setting reflects the format of the header lines in a FASTA file. If y is NULL and a supported operating system is identified, the operating system's grep function (or other specified in the grep argument) is applied directly to the file instead of R's grep. This avoids having to read the file into R using readLines. If the lines from the file were obtained in a preceding operation, they can be supplied to this function in the lines argument.

read.fasta is used to retrieve entries from a FASTA file. To read only selected sequences pass the line numbers of the header lines to the function in i (they can be identified using e.g. grep.file). The function returns various formats depending on the value of ret; the default count returns a dataframe of amino acid counts (the data frame can be given to add.protein in order to add the proteins to thermo$protein), seq returns a list of sequences, and fas returns a list of lines extracted from the FASTA file, including the headers (this can be used e.g. to generate a new FASTA file with only the selected sequences). Similarly to grep.file, this function utilizes the OS's grep on supported operating systems in order to identify the header lines as well as cat to read the file, otherwise readLines and R's substr are used to read the file and locate the header lines. If the line numbers of the header lines were previously determined, they can be supplied in ihead. Optionally, the lines of a previously read file may be supplied in lines (in this case no file is needed so file should be set to "").

count.aa counts the occurrences of each amino acid in a sequence (seq), returning a data frame with amino acids in the same order as thermo$protein. It is not case-sensitive. A warning is generated if any character in seq, excluding spaces, is not one of the single-letter amino acid abbreviations. start and/or stop can be provided to count amino acids in a fragment of the sequence (extracted using substr). If only one of start or stop is present, the other defaults to 1 (start) or the length of the sequence (stop).

uniprot.aa returns a data frame of amino acid composition, in the format of thermo$protein, retrieved from the protein sequence if it is available from UniProt (http://uniprot.org; The UniProt Consortium, 2012). The protein argument corresponds to the Entry name on the UniProt search pages.

References

The UniProt Consortium (2012) Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 40, D71--D75. http://dx.doi.org/10.1093/nar/gkr981

Examples

Run this code

## reading a protein FASTA file
# the path to the file
file <- system.file("extdata/fasta/EF-Tu.aln", package="CHNOSZ")
# read the sequences, and print the first one
read.fasta(file, ret="seq")[[1]]
# count the amino acids in the sequences
aa <- read.fasta(file)
# compute lengths (number of amino acids)
protein.length(aa)

# download amino acid composition of a protein
# start at position 2 to remove the initiator methionine
aa <- uniprot.aa("ALAT1_HUMAN", start=2)
# add it to thermo$protein
ip <- add.protein(aa)
# now it's possible to calculate some properties
protein.length(ip)
protein.formula(ip)
subcrt("ALAT1_HUMAN", c("cr", "aq"), c(-1, 1))
# the amino acid composition can be saved for future use
write.csv(aa, "saved.aa.csv", row.names=FALSE)
# in another R session, the protein can be loaded without using uniprot.aa()
aa <- read.aa("saved.aa.csv")
add.protein(aa)

## count amino acids in a sequence
count.aa("GGSGG")
# warnings are issued for unrecognized characters
atest <- count.aa("WhatAmIMadeOf?")
# there are 3 "A" (alanine)
stopifnot(atest[, "A"]==3)

Run the code above in your browser using DataLab