optimal_index: optimal codons index for the four and six codon boxes

Description

The function optimal_index can estimate the relative amount of GC-ending optimal codon for the four and six codon boxes codon in a given mutational background. The function has same mathematical formula as sscu and also take into account of background mutation rate, thus is comparable with the S index. However, since the set of GC-ending optimal codons are likely to be different among different species, the index can not be compared among different species.

Usage

optimal_index(high_cds_file = NULL, genomic_cds_file = NULL)

Arguments

high_cds_file

a character vector for the filepath of the highly expressed genes

genomic_cds_file

a character vector for the filepath of the whole genome cds file

Value

a numeric vector optimal_index is returned

Details

The argument high_cds_file must be specified with the input filepath for the highly expressed genes. The file should be a multifasta file contains 40 highly, including elongation factor Tu, Ts, G, 50S ribosomal protein L1 to L6, L9 to L20, 30S ribosomal protein S2 to S20. This file can be generated by either directly extract these DNA sequence from genbank file, or parse by blast program. For the four amino acids (Phy, Tyr, Ile and Asn), the C-ending codons are always preferred than the U-ending codons. Thus, only these four codons were taken into account in the analyses.

The arguments, genomic_cds_file, is used to calculate the genomic mutation rate (gc3). The genomic_cds_file should be a multifasta file contains all the coding sequences in the genome, and the function use it to calculate the genomic gc3 and mutation rate.

Noted, most of the AT biased genomes do not have any GC-ending optimal codons for the four and six codon boxes, thus the function will report NA as output.

Currently, the function only calculate the usage of GC-ending optimal codon. In addition, most of the AT biased genomes do not have any GC-ending optimal codons for the four and six codon boxes, thus the function will report NA as output. The index 0 means the optimal codon usage follows the mutation pattern, whereas higher values menas more GC-ending optimal codons are used in the highly expressed genes.

References

unpublished paper from Yu Sun

Examples

Run this code

# ----------------------------------------------- #
#     Lactobacillus kunkeei example               #
# ----------------------------------------------- #

  # Here is an example to load the data included in the sscu package
  # input the two multifasta files to calculate sscu 
  optimal_index(high_cds_file=system.file("sequences/L_kunkeei_highly.ffn",package="sscu"),genomic_cds_file=system.file("sequences/L_kunkeei_genome_cds.ffn",package="sscu"))

  # if you want to load your own data, you just specify the file path for your input as these examples
  # optimal_index(high_cds_file="/home/yu/Data/codon_usage/bee_endosymbionts/sharp_40_highly_dataset/Bin2.ffn",genomic_cds_file="/home/yu/Data/codon_usage/bee_endosymbionts/cds_filtered/Bin2.ffn")

Run the code above in your browser using DataLab