Learn R Programming

microseq (version 2.1.5)

findGenes: Finding coding genes

Description

Finding coding genes in genomic DNA using the Prodigal software.

Usage

findGenes(
  genome,
  faa.file = "",
  ffn.file = "",
  proc = "single",
  trans.tab = 11,
  mask.N = FALSE,
  bypass.SD = FALSE
)

Value

A GFF-table (see readGFF for details) with one row for each detected coding gene.

Arguments

genome

A table with columns Header and Sequence, containing the genome sequence(s).

faa.file

If provided, prodigal will output all proteins to this fasta-file (text).

ffn.file

If provided, prodigal will output all DNA sequences to this fasta-file (text).

proc

Either "single" or "meta", see below.

trans.tab

Either 11 or 4 (see below).

mask.N

Turn on masking of N's (logical)

bypass.SD

Bypass Shine-Dalgarno filter (logical)

Author

Lars Snipen and Kristian Hovde Liland.

Details

The external software Prodigal is used to scan through a prokaryotic genome to detect the protein coding genes. This free software can be installed from https://github.com/hyattpd/Prodigal.

In addition to the standard output from this function, FASTA files with protein and/or DNA sequences may be produced directly by providing filenames in faa.file and ffn.file.

The input proc allows you to specify if the input data should be treated as a single genome (default) or as a metagenome. In the latter case the genome are (un-binned) contigs.

The translation table is by default 11 (the standard code), but table 4 should be used for Mycoplasma etc.

The mask.N will prevent genes having runs of N inside. The bypass.SD turn off the search for a Shine-Dalgarno motif.

See Also

readGFF, gff2fasta.

Examples

Run this code
if (FALSE) {
# This example requires the external prodigal software
# Using a genome file in this package.
genome.file <- file.path(path.package("microseq"),"extdata","small.fna")

# Searching for coding sequences, this is Mycoplasma (trans.tab = 4)
genome <- readFasta(genome.file)
gff.tbl <- findGenes(genome, trans.tab = 4)

# Retrieving the sequences
cds.tbl <- gff2fasta(gff.tbl, genome)

# You may use the pipe operator
library(ggplot2)
readFasta(genome.file) %>% 
  findGenes(trans.tab = 4) %>% 
  filter(Score >= 50) %>% 
  ggplot() +
  geom_histogram(aes(x = Score), bins = 25)
}

Run the code above in your browser using DataLab