partition: Parition data by most inter-dependent positions

Description

Partitions data by the nucleotides at the most inter-dependent positions as measures by pairwise mutual information. Paritioning is performed recursively on the resulting subsets until i) the number of sequences in a partition is less then minElements, ii) the average pairwise dependency between the current position and numBestForSorting other positions with the largest mutual information value drops below threshold, or iii) maxNum recursive splits have already been performed. If splitting results in smaller partitions than minElements, these are added to the smallest partition with more than minElements sequences.

Usage

partition(data, minElements = 10, threshold = 0.1, numBestForSorting = 3,
  maxNum = 6, sortByWeights = NULL)

Arguments

data

the data as DLData object

minElements

the minimum number of elements to perform a further split.

threshold

the threshold on the average mutual information value

numBestForSorting

the number of dependencies to other positions considered

maxNum

the maximum number of recursive splits

sortByWeights

if TRUE, partitions are ordered by their average weight value, if false by frequency of symbols at the partitioning position otherwise. If NULL, the $sortByWeights value of the DLData object is used.

Value

the partitions as list of DLData objects

Examples

Run this code

# NOT RUN {
# create DLData object
seqs <- read.table(system.file("extdata", "cjun.txt", package = "DepLogo"),
    stringsAsFactors = FALSE)
data <- DLData(sequences = seqs[, 1], weights = log1p(seqs[,2]) )

# partition data using default parameters
partitions <- partition(data)

# partition data using a threshold of 0.3 on the mutual 
# information value to the most dependent position, 
# sorting the resulting partitions by weight
partitions2 <- partition(data = data, threshold = 0.3, numBestForSorting = 1, sortByWeights = TRUE)
# }

Run the code above in your browser using DataLab