Learn R Programming

DNAshapeR (version 1.0.2)

encodeSeqShape: Encode k-mer DNA sequence and n-th order DNA Shape features

Description

DNAshapeR can be used to generate feature vectors for a user-defined model. These models can be based on DNA sequence (1-mer, 2-mer, 3-mer) or DNA shape (MGW, Roll, ProT, HelT) features or any combination thereof. Sequence is encoded as four binary features (i.e., 0001 for adenine, 0010 for cytosine, 0100 for guanine, and 1000 for thymine, for encoding of 1-mers) at each nucleotide position (Zhou, et al., 2015). Encoding of 2-mers and 3-mers (16 and 64 binary features at each position, respectively) is also supported. Shape features include first and second order (or higher order) values for the four structural parameters MGW, Roll, ProT and HelT. The second order shape features are product terms of values for the same category of shape features at adjacent positions. The function allows to generate any subset of these features, e.g. a given shape category or first order shape features, and any desired combination of shape and sequence features. Feature encoding returns a feature matrix for a dataset of multiple sequences, in which each sequence generates a concatenated feature vector. The output of this function can be used directly for any statistical machine learning method.

Usage

encodeSeqShape(fastaFileName, shapeMatrix, featureNames, normalize)

Arguments

fastaFileName
A character name of the input fasta format file, including full path to file if it is located outside the current working directory.
shapeMatrix
A matrix containing DNAshape prediction result
featureNames
A vector containing a combination of user-defined sequence and shape parameters. The parameters can be any combination of "k-mer", "n-shape", "n-MGW", "n-ProT", "n-Roll", "n-HelT" (k, n are integers)
normalize
A logical indicating whether to perform normalization. Default to TRUE.

Value

featureVector A matrix containing encoded features. Sequence features are represented as binary numbers, while shape features are represented as real numbers.

Examples

Run this code
fn <- system.file("extdata", "CGRsample_short.fa", package = "DNAshapeR")
pred <- getShape(fn)
featureNames <- c("1-shape")
featureVector <- encodeSeqShape(fn, pred, featureNames)

Run the code above in your browser using DataLab