grep.file
returns the line numbers of header lines in a FASTA file.
Matching header lines are identified having the search term pattern
and optionally a term to exclude in y
.
The ignore.case
option is passed to grep
, which does the work of finding lines that match.
Only lines that start with the expression in startswith
are searched; the default setting reflects the format of the header lines in a FASTA file.
If y
is NULL and a supported operating system is identified, the operating system's grep function (or other specified in the grep
argument) is applied directly to the file instead of R's grep
.
This avoids having to read the file into R using readLines
.
If the lines from the file were obtained in a preceding operation, they can be supplied to this function in the lines
argument.read.fasta
is used to retrieve entries from a FASTA file.
To read only selected sequences pass the line numbers of the header lines to the function in i
(they can be identified using e.g. grep.file
).
The function returns various formats depending on the value of ret
.
The default count returns a data frame of amino acid counts (the data frame can be given to add.protein
in order to add the proteins to thermo$protein
), seq returns a list of sequences, and fas returns a list of lines extracted from the FASTA file, including the headers (this can be used e.g. to generate a new FASTA file with only the selected sequences).
Similarly to grep.file
, this function utilizes the OS's grep on supported operating systems in order to identify the header lines as well as cat to read the file, otherwise readLines
and R's substr
are used to read the file and locate the header lines.
If the line numbers of the header lines were previously determined, they can be supplied in ihead
.
Optionally, the lines of a previously read file may be supplied in lines
(in this case no file is needed so file
should be set to "").
When ret
is count, the names of the proteins in the resulting data frame are parsed from the header lines of the file, unless id
is provided.
count.aa
counts the occurrences of each amino acid or nucleic-acid base in a sequence (seq
).
For amino acids, the columns in the returned data frame are in the same order as thermo$protein
.
Letters are matched without regard for case.
A warning is generated if any character in seq
, excluding spaces, is not one of the single-letter amino acid or nucleobase abbreviations.
start
and/or stop
can be provided to count a fragment of the sequence (extracted using substr
).
If only one of start
or stop
is present, the other defaults to 1 (start
) or the length of the sequence (stop
).
uniprot.aa
returns a data frame of amino acid composition, in the format of thermo$protein
, retrieved from the protein sequence if it is available from UniProt (http://uniprot.org; The UniProt Consortium, 2012).
The protein
argument corresponds to the Entry name on the UniProt search pages.