oriloc: Prediction of origin and terminus of replication in bacteria.

Description

This program finds the putative origin and terminus of replication in procaryotic genomes. The program discriminates between codon positions.

Usage

oriloc(seq.fasta = system.file("sequences/ct.fasta.gz", package = "seqinr"),
 g2.coord = system.file("sequences/ct.predict", package = "seqinr"),
 glimmer.version = 3,
oldoriloc = FALSE, gbk = NULL, clean.tmp.files = TRUE, rot = 0)

Arguments

seq.fasta

Character: the name of a file which contains the DNA sequence of a bacterial chromosome in fasta format. The default value, system.file("sequences/ct.fasta.gz", package ="seqinr") is the fasta file ct.fasta.gz. This is the file for the complete genome sequence of Chlamydia trachomatis that was used in Frank and Lobry (2000). You can replace this by something like seq.fasta = "myseq.fasta" to work with your own data if the file myseq.fasta is present in the current working directory (see getwd), or give a full path access to the sequence file (see file.choose).

g2.coord

Character: the name of file which contains the output of glimmer program (*.predict in glimmer version 3)

glimmer.version

Numeric: glimmer version used, could be 2 or 3

oldoriloc

Logical: to be set at TRUE to reproduce the (deprecated) outputs of previous (publication date: 2000) version of the oriloc program.

gbk

Character: the URL of a file in GenBank format. When provided oriloc use as input a single GenBank file instead of the seq.fasta and the g2.coord. A local temporary copy of the GenBank file is made with download.file if gbk starts with http:// or ftp:// or file:// and whith file.copy otherwise. The local copy is then used as input for gb2fasta and gbk2g2 to produce a fasta file and a glimmer-like (version 2) file, respectively, to be used by oriloc instead of seq.fasta and g2.coord .

clean.tmp.files

Logical: if TRUE temporary files generated when working with a GenBank file are removed.

rot

Integer, with zero default value, used to permute circurlarly the genome.

Value

A data.frame with seven columns: g2num for the CDS number in the g2.coord file, start.kb for the start position of CDS expressed in Kb (this is the position of the first occurence of a nucleotide in a CDS regardless of its orientation), end.kb for the last position of a CDS, CDS.excess for the DNA walk for gene orientation (+1 for a CDS in the direct strand, -1 for a CDS in the reverse strand) cummulated over genes, skew for the cummulated composite skew in third codon positions, x for the cummulated T - A skew in third codon position, y for the cummulated C - G skew in third codon positions.

Details

The method builds on the fact that there are compositional asymmetries between the leading and the lagging strand for replication. The programs works only with third codon positions so as to increase the signal/noise ratio. To discriminate between codon positions, the program use as input either an annotated genbank file, either a fasta file and a glimmer2.0 (or glimmer3.0) output file.

References

More illustrated explanations to help understand oriloc outputs are available there: https://pbil.univ-lyon1.fr/software/Oriloc/howto.html.

Examples of oriloc outputs on real sequence data are there: https://pbil.univ-lyon1.fr/software/Oriloc/index.html.

The original paper for oriloc: Frank, A.C., Lobry, J.R. (2000) Oriloc: prediction of replication boundaries in unannotated bacterial chromosomes. Bioinformatics, 16:566-567. https://doi.org/10.1093/bioinformatics/16.6.560

A simple informal introduction to DNA-walks: Lobry, J.R. (1999) Genomic landscapes. Microbiology Today, 26:164-165. https://seqinr.r-forge.r-project.org/MicrTod_1999_26_164.pdf

An early and somewhat historical application of DNA-walks: Lobry, J.R. (1996) A simple vectorial representation of DNA sequences for the detection of replication origins in bacteria. Biochimie, 78:323-326.

Glimmer, a very efficient open source software for the prediction of CDS from scratch in prokaryotic genome, is decribed at http://ccb.jhu.edu/software/glimmer/index.shtml. For a description of Glimmer 1.0 and 2.0 see:

Delcher, A.L., Harmon, D., Kasif, S., White, O., Salzberg, S.L. (1999) Improved microbial gene identification with GLIMMER, Nucleic Acids Research, 27:4636-4641.

Salzberg, S., Delcher, A., Kasif, S., White, O. (1998) Microbial gene identification using interpolated Markov models, Nucleic Acids Research, 26:544-548.

citation("seqinr")

Examples

Run this code

# NOT RUN {
#
# A little bit too long for routine checks because oriloc() is already
# called in draw.oriloc.Rd documentation file. Try example(draw.oriloc)
# instead, or copy/paste the following code:
#
out <- oriloc()
plot(out$st, out$sk, type = "l", xlab = "Map position in Kb",
    ylab = "Cumulated composite skew", 
    main = expression(italic(Chlamydia~~trachomatis)~~complete~~genome))
#
# Example with a single GenBank file:
#
out2 <- oriloc(gbk="ftp://pbil.univ-lyon1.fr/pub/seqinr/data/ct.gbk")
draw.oriloc(out2)
#
# (some warnings are generated because of join in features and a gene that
# wrap around the genome)
#
# }

Run the code above in your browser using DataLab