is.fasta
checks if a file is in FASTA format. A very simple test is performed: if either of the first two lines of the file starts with >, then the function returns TRUE, otherwise it returns FALSE. grep.file
is used to search for entries in a FASTA file. It returns the line numbers of the matching FASTA headers. It takes a search term in pattern
and optionally a term to exclude in y
. The ignore.case
option is passed to grep
, which does the work of finding lines that match. Only lines that start with the expression in startswith
are searched; the default setting reflects the format of the header line for each sequence in a FASTA file.
If y
is NULL and a supported operating system is identified, the operating system's grep function (or other specified in the grep
argument) is applied directly to the file instead of R's grep
. This avoids having to read the file into R using readLines
. If the lines from the file were obtained in a preceding operation, they can be supplied to this function in the lines
argument.
read.fasta
is used to retrieve entries from a FASTA file. The line numbers for the headers of the desired sequences are passed to the function in i
(they can be generated using grep.file
). The function returns various formats depending on the value of ret
; the default count returns a dataframe of amino acid counts (the data frame can be given to add.protein
in order to add the proteins to thermo$protein
), seq returns a list of sequences, and fas returns a list of lines extracted from the FASTA file, including the headers (this can be used e.g. to generate a new FASTA file with only the selected sequences). Similarly to grep.file
, this function utilizes the OS's grep on supported operating systems in order to identify the header lines as well as cat to read the file, otherwise readLines
and R's substr
are used to read the file and locate the header lines. lines
, if it is given, bypasses the reading of the file and also overrides the use of the OS's tools. If the line numbers of the header lines were previously determined, they can be supplied in ihead
.
splitline
takes a single character object (the line
) and splits it into multiple lines of the given length (the last line can be shorter than this). It returns a character object that contains the lines. This function is utilized by trimfas
, which extracts the specified positions from a (usually) aligned FASTA file. The length of the lines output by trimfas
is equal to the length of the first sequence line in the given file
.