Last chance! 50% off unlimited learning
Sale ends in
Search the header lines of a FASTA file, read protein sequences from a file, count numbers of amino acids in each sequence, and download sequences from UniProt.
read.fasta(file, iseq = NULL, ret = "count", lines = NULL,
ihead = NULL, start=NULL, stop=NULL, type="protein", id = NULL)
count.aa(seq, start=NULL, stop=NULL, type="protein")
uniprot.aa(protein, start=NULL, stop=NULL)
character, path to FASTA file
numeric, which sequences to read from the file
character, specification for type of return (count, sequence, or FASTA format)
list of character, supply the lines here instead of reading them from file
numeric, which lines are headers
numeric, position in sequence to start counting
numeric, position in sequence to stop counting
character, sequence type (protein or DNA)
character, value to be used for protein
in output table
character, amino acid sequence of a protein
character, entry name for protein in UniProt
read.fasta
returns a list of sequences or lines (for ret
equal to seq or fas, respectively), or a data frame with amino acid compositions of proteins (for ret
equal to count) with columns corresponding to those in thermo$protein
.
read.fasta
is used to retrieve entries from a FASTA file.
Use iseq
to select the sequences to read (the default is all sequences).
The function returns various formats depending on the value of ret
.
The default count returns a data frame of amino acid counts (the data frame can be given to add.protein
in order to add the proteins to thermo$protein
), seq returns a list of sequences, and fas returns a list of lines extracted from the FASTA file, including the headers (this can be used e.g. to generate a new FASTA file with only the selected sequences).
This function utilizes the OS's grep on supported operating systems in order to identify the header lines as well as cat to read the file, otherwise readLines
and R's substr
are used to read the file and locate the header lines.
If the line numbers of the header lines were previously determined, they can be supplied in ihead
.
Optionally, the lines of a previously read file may be supplied in lines
(in this case no file is needed so file
should be set to "").
When ret
is count, the names of the proteins in the resulting data frame are parsed from the header lines of the file, unless id
is provided.
If id is not given, and a UniProt FASTA header is detected (regular expression "\|......\|.*_"
), information there (accession, name, organism) is split into the protein
, abbrv
, and organism columns of the resulting data frame.
count.aa
counts the occurrences of each amino acid or nucleic-acid base in a sequence (seq
).
For amino acids, the columns in the returned data frame are in the same order as thermo$protein
.
Letters are matched without regard for case.
A warning is generated if any character in seq
, excluding spaces, is not one of the single-letter amino acid or nucleobase abbreviations.
start
and/or stop
can be provided to count a fragment of the sequence (extracted using substr
).
If only one of start
or stop
is present, the other defaults to 1 (start
) or the length of the sequence (stop
).
uniprot.aa
returns a data frame of amino acid composition, in the format of thermo$protein
, retrieved from the protein sequence if it is available from UniProt (http://uniprot.org).
The protein
argument corresponds to the Entry name on the UniProt search pages.
seq2aa
, like count.aa
, counts amino acids in a user-input sequence, but returns a data frame in the format of thermo$protein
.
nucleic.formula
for an example of counting nucleobases in a DNA sequence.
# NOT RUN {
## reading a protein FASTA file
# the path to the file
file <- system.file("extdata/fasta/EF-Tu.aln", package="CHNOSZ")
# read the sequences, and print the first one
read.fasta(file, ret="seq")[[1]]
# count the amino acids in the sequences
aa <- read.fasta(file)
# compute lengths (number of amino acids)
protein.length(aa)
# }
# NOT RUN {
# download amino acid composition of a protein
# start at position 2 to remove the initiator methionine
aa <- uniprot.aa("ALAT1_HUMAN", start=2)
# add it to thermo$protein
ip <- add.protein(aa)
# now it's possible to calculate some properties
protein.length(ip)
protein.formula(ip)
subcrt("ALAT1_HUMAN", c("cr", "aq"), c(-1, 1))
# the amino acid composition can be saved for future use
write.csv(aa, "saved.aa.csv", row.names=FALSE)
# in another R session, the protein can be loaded without using uniprot.aa()
aa <- read.csv("saved.aa.csv", as.is=TRUE)
add.protein(aa)
## count amino acids in a sequence
count.aa("GGSGG")
# warnings are issued for unrecognized characters
atest <- count.aa("WhatAmIMadeOf?")
# there are 3 "A" (alanine)
stopifnot(atest[, "A"]==3)
# }
Run the code above in your browser using DataLab