The input data for the snowball analysis are the profiling of the whole genome gene expression and the mutation status of a recurrent genetic event on a group of samples. The analysis has been tested on human primary tumor samples and the minimum sample size required per group is three. Snowball does not require a balanced design between groups (see references).
The main function of the package is snowball
,
it requires two input data, named y
and X
,
where y
is a binary vector indicating the mutation
status of the samples, and X
is the gene expression
profiles with rows corresponding to genes and columns the
samples. y
can be a numerical
,
character
or logical
vector. It can also be a
factor. The typical format is a character vector with two
values indicating the the mutation status of each subject.
X
is expected to be a data.frame with gene
names as its row names, and typically it is after the
initial filtering and in log scale. A reasonable choice for
the initial filtering could be based on the variation of
gene expression across all the samples in the study, e.g.,
using the coefficient of variation of each gene to select
the ones with greater values than a given cutoff.
The other functions include plotJn
for
visualizing gene selection, select.features
for gene ranking and statistical significance assessment,
and toplist
to report the top genes based on
the user provided cutoff.
McArdlei, B.H. and Anderson, M.J. (2001) Fitting multivariate models to community data: A comment on distance-based redundancy analysis. Ecology 82(1): 290-297.
Xu, Y., Guo, X., Sun, J. and Zhao. Z. Snowball: resampling combined with distance-based regression to discover transcriptional consequences of driver mutation, manuscript.
Guo, X., Xu, Y. and Zhao, Z.. Driver mutation BRAF regulates cell proliferation and apoptosis via MITF in the pathogenesis of melanoma, manuscript.