POS.Feature: Transformation of nucleic acid sequences into numeric vectors using position-wise frequency of nucleotides.

Description

This encoding scheme was devised by Li et al. (2012). Frequencies of 4 nucleotides are first computed at each position for both positive and negative datasets, resulting in two \(4*L\) probability tables for the two classes for sequence length \(L\). A \(4*L\) statistical difference table is obtained by elementwise substraction of the two probability distribution tables, which is then used for encoding of sequences. Further, as per sparse encoding, the nucleotides A, T, G and C can be encoded as (1,0,0,0), (0,1,0,0), (0,0,1,0) and (0,0,0,1) respectively. The value 1 of sparse encoding is then replaced with the difference values obtained from the difference table for encoding nucleotide at each postion. Thus, it can be said that POS feature encoding is a blending of MN-FDTF (Huang et al., 2006) and Sparse encoding (Meher et al., 2016) technique.

Usage

POS.Feature(positive_class, negative_class, test_seq)

Arguments

positive_class

Sequence dataset of the positive class, must be an object of class DNAStringSet.

negative_class

Sequence dataset of the negative class, must be an object of class DNAStringSet.

test_seq

Sequences to be encoded into numeric vectors, must be an object of class DNAStringSet.

Value

A numeric matrix of order \(m*4n\), where \(m\) is the number of sequences in test_seq and \(n\) is the length of sequence.

Details

The DNAstringSet object can be obtained by reading the sequences in FASTA format using the function readDNAStringSetavailable in the Biostrings package of Bioconductor.

References

Huang, J., Li, T., Chen, K. and Wu, J. (2006). An approach of encoding for prediction of splice sites using SVM. Biochimie, 88(7): 923-929.
Li, J.L., Wang, L.F., Wang, H.Y., Bai, L.Y., Yuan, Z.M. (2012). High-accuracy splice sites prediction based on sequence component and position features. Genetics and Molecular Research, 11(3): 3432-3451.
Meher, P.K., Sahu, T.K., Rao, A.R. and Wahi, S.D. (2016). A computational approach for prediction of donor splice sites with improved accuracy. Journal of Theoretical Biology, 404: 285-294.

Examples

Run this code

# NOT RUN {
data(droso)
positive <- droso$positive
negative <- droso$negative
test <- droso$test
pos <- positive[1:200]
neg <- negative[1:200]
tst <- test
enc <- POS.Feature(positive_class=pos, negative_class=neg, test_seq=tst)
enc
# }

Run the code above in your browser using DataLab