pre.proc: Prepare data inputs for the main function `run.CONDOP()`.

Description

Load the annotation files and a list of count tables (or coverage vectors). Each count table is related to a specific experimental condition and it must contain two columns: fwd (coverage depth on the forward strand) and rev (coverage depth on the reverse strand). The annotations files are: - GFF-like file, it can be downloaded from the NCBI genomes ftp directory, ftp://ftp.ncbi.nih.gov/genomes. - DOOR-like file, it can be downloaded from http://csbl.bmb.uga.edu/DOOR/displayspecies.php. - FASTA-like file, it can be downloaded from www.ncbi.nlm.nih.gov.

Usage

pre.proc(gff.file, door.op.file, fasta.file, list.cov.dat, remove.cov = list("rRNA"), log2.expr = TRUE, sw = 100, save.data.file = NULL, verbose = TRUE)

Arguments

gff.file

A full local path indicating the GFF-like file to load .

door.op.file

A full local path indicating the DOOR-like file to load (DOOR-operon annotations).

fasta.file

A full local path indicating the FASTA-like file to load or a character string representing the accession number of the genome sequence to download.

list.cov.dat

List of count tables.

remove.cov

List of character values. Each charcater value corresponds to a specific type of annotated features. The coverage depth from those annotated feature will be removed. The default list contains "rRNA". The coverage depth of "rRNA" features will be removed.

log2.expr

Logical value indicating whether CONDOP will be using logged values of expression. The expression values are compiled in RPKM values. Default logical value is TRUE.

Numeric value specifying the sliding window size. Default value is 100.

save.data.file

Character string naming a file. The file will contain the input for the CONDOP main process.

verbose

Indicate whether information about the process should be reported. Defaults to TRUE.

Value

genes.and.ops: A merged dataframe containing information about genes/features and operons merged.
gseq: A character vector representing the genome sequence of the target organism.
igr.pos: A dataframe containing information about intergenic regions (IRGs) - forward (+) strand.
igr.neg: A dataframe containing information about intergenic regions (IRGs) - reverse (-) strand.
tl.cds: A list of dataframes containing the expression levels of annotated coding sequences (CDS regions). One dataframe for each count table.
tl.igr.pos: A list of dataframes containing the expression levels of intergenic sequences (IGR regions) - forward (+) strand. One dataframe for each count table.
tl.igr.neg: A list of dataframes containing the expression levels of intergenic sequences (IGR regions) - reverse (-) strand. One dataframe for each count table.
sid.points: A list of dataframes containing information about boundaries of transcriptionally active regions.
cut.lhe: A list of numeric vectors indicating the cut-off values to distinguish low expressed RNA-seq data from high expression data on the forward and reverse strands. One dataframe for each count table.

Examples

Run this code

## Not run: 
#     file_operon_annot <- system.file("extdata", "1944.opr", package="CONDOP")
#     file_genome_seq   <- system.file("extdata", "EC-k12-MG1655.fasta", package="CONDOP")
#     data(ct1)
#     data.in <- pre.proc(file_genome_annot, file_operon_annot, "NC_000913", 
#                         list.cov.dat = list(ct1 = ct1)) 
# ## End(Not run)

Run the code above in your browser using DataLab