readFasta2: Read File Of Protein Sequences In Fasta Format

Description

Read fasta formatted file (from UniProt) to extract (protein) sequences and name.

Usage

readFasta2(
  filename,
  delim = "|",
  databaseSign = c("sp", "tr", "generic", "conta", "synt", "gi"),
  removeEntries = NULL,
  tableOut = FALSE,
  UniprSep = c("OS=", "OX=", "GN=", "PE=", "SV="),
  strictSpecPattern = TRUE,
  cleanCols = TRUE,
  silent = FALSE,
  callFrom = NULL,
  debug = FALSE
)

Value

This function returns (depending on argument tableOut) a simple character vector (of sequences) with (entire) Uniprot annotation as name or b) a matrix with columns: 'database','uniqueIdentifier','entryName','proteinName','sequence' and further columns depending on argument UniprSep

Arguments

filename: (character) names fasta-file to be read; .gz compressed files can be read, too (see examples)
delim: (character) delimeter at header-line
databaseSign: (character) characters at beginning right after the '>' (typically specifying the data-base-origin), they will be excluded from the sequance-header
removeEntries: (character) if removeEntries='empty' allows removing entries without any sequence entries; set to removeEntries='duplicated' to remove duplicate entries (same sequence and same header) removeEntries='allNA' remove columns with all entries NA (if tableOut=TRUE)
tableOut: (logical) toggle to return named character-vector or matrix with enhaced parsing of fasta-header. The resulting matrix will contain the comumns 'database','uniqueIdentifier','entryName','proteinName','sequence' and further columns depending on argument UniprSep
UniprSep: (character) separators for further separating entry-fields if tableOut=TRUE, see also UniProt-FASTA-headers
strictSpecPattern: (logical or character) deprecated, this argument is not used any more
cleanCols: (logical) deprecated, please use argument removeEntries="allNA"
silent: (logical) suppress messages
callFrom: (character) allows easier tracking of messages produced
debug: (logical) supplemental messages for debugging

Details

Read fasta-header -as is (ie without any parsing) : Set argument tableOut=FALSE. If tableOut=TRUE the output will be organized as matrix for separating meta-annotation (eg uniqueIdentifier, entryName, proteinName, GN) in separate columns. Please keep in mind that parsers wer primarily designed for the UniProt format.

Examples

Run this code

## Tiny example with common contaminants
path1 <- system.file('extdata', package='wrProteo')
fiNa <-  "conta1.fasta.gz"
fasta1 <- readFasta2(file.path(path1, fiNa))
## now let's read and further separate annotation-fields
fasta2 <- readFasta2(file.path(path1, fiNa), tableOut=TRUE)
str(fasta1)