main function for Snowball analysis

This is the main function to perform snowball analysis. It requires a minimum input with many default operating parameters set.

snowball(y, X, ncore = 1, d = 300, B = 10000, B.i = 2000,
  sample.n = 100, resample.method = c("sample", "none", "combn"),
  mode.resample = c("count.class", "flat", "percent.class"), k.resample = 1)
a factor variable for mutation status
data.frame containing gene expression data. The columns of X should be aligned with y on samples
number of processors to use for parallel computation. Set ncore = 1 or NULL for non-parallel computation mode
the size of gene subset for gene level resampling. See references on $d$ in $X_d^x$
bootstrap size, which is $B$ in $J_n(x)$, defining the total number of gene subsets used to estimate $J_n$, $$J_n(x)=\frac{1}{B}\sum_{i=1}^{B}(\frac{1}{K}\sum_{j=1}^{K}\phi_n(g(X_{i,j}),\kappa))$$
bootstrap size deployed on each child job in parallel mode
number of samples drawn from the subject level resampling, denoted as $K$ in $J_n(x)$. It is ignored if resample.method="none" or "combn"
this defines how the subject level resampling is performed. The possible values are "sample", "none" and "combn". Let resample.method = "sample" for random sampling with replacement, "none"
this specifies how the subjects are counted for subject level leave-k-out random sampling, and whether the stratification by group is applied. The possible input values are "count.class", "percent.class" or "no"
A numerical value specifies the number of subjects left out during the subject level resampling. It is an integer number if mode.resample = "count.class" and a numerical number between 0 and 1 if mode.resample = "percent.

  • A data.frame containing two variables: weights and positives. weights are the $J_n(x)$ values for all genes and positives are indicators to whether a specific $J_n(x)$ is above or below the median of all $J_n(x)$'s.


The resampling is applied on two dimensions (see references): gene level resamping and subject level resampling. The gene level resampling is straightforward - each time it takes d number of genes randomly from all the genes in X. The subject level resampling is specified by the combination of values given in sample.n, resample.method, mode.resample and k.resample. The flat resampling on all subjects regardless of grouping, specified by letting resample.method="none", is simply a leave-k-out random sampling, where k is given by k.resample. In more complex cases, the subject level resampling can be stratified based on the groups defined on y, in which case, resample.method takes the value of either "sample" or "combn". When resample.method = "sample", it applies a leave-k-out random sampling within each group and finally only sample.n samples are generated from the resampling. When resample.method = "combn", all possible combinations after conditioning on the restrictions given by mode.resample and k.resample are included. In this case, the total number of resampled samples varies depending on the sample size of the study. mode.resample="count.class" or "percent.class" defines two ways to calculate the number of subjects to be left out in the random sampling. The value of "count.class" indicates the exact number to be left out and "percent.class" indicates the percentage of total subjects to be left out. In all cases, k.resample specifies the number of subjects left out in the leave-k-out sampling. If k.resample is only a scalar integer number, the subjects will be sampled with exactly k.resample subjects left out, either across all the subjects in the case of flat sampling, or within each group in the case of stratified resampling by group. Instead, if k.resample a vector with two integer numbers, the sampling will leave out the number of subjects from the two groups based on the two numbers provided. The order of which number is taken for which group is based on that the first number is assigned to the first factor level and the second number is assigned to the second factor level of factor(y). Check factor(y) to see how the two numbers in k.resample would be assigned to the two groups. A vector with two values for k.resample produces error if mode.resample = "flat". This flexible way of defining the sampling scheme allows easy specification for balanced sample size between groups. See references for more details.


Xu, Y., Guo, X., Sun, J. and Zhao. Z. Snowball: resampling combined with distance-based regression to discover transcriptional consequences of driver mutation, manuscript.

  • snowball
# check the demo dataset
## A test run
Bn <- 10000
ncore <-4
# call Snowball
sb <- snowball(y=sb.mutation,X=sb.expression,
# process the gene ranking and selection
sb.sel <- select.features(sb)
# plot the Jn values
plotJn(sb, sb.sel)
# get the significant gene list
top.genes <- toplist(sb.sel)
Documentation reproduced from package DESnowball, version 1.0, License: GPL-3

Community examples

Looks like there are no examples yet.