# snowball

##### main function for Snowball analysis

This is the main function to perform snowball analysis. It requires a minimum input with many default operating parameters set.

##### Usage

```
snowball(y, X, ncore = 1, d = 300, B = 10000, B.i = 2000,
sample.n = 100, resample.method = c("sample", "none", "combn"),
mode.resample = c("count.class", "flat", "percent.class"), k.resample = 1)
```

##### Arguments

- y
- a factor variable for mutation status
- X
- data.frame containing gene expression data. The
columns of
`X`

should be aligned with`y`

on samples - ncore
- number of processors to use for parallel
computation. Set
`ncore = 1`

or`NULL`

for non-parallel computation mode - d
- the size of gene subset for gene level resampling. See references on $d$ in $X_d^x$
- B
- bootstrap size, which is $B$ in $J_n(x)$, defining the total number of gene subsets used to estimate $J_n$, $$J_n(x)=\frac{1}{B}\sum_{i=1}^{B}(\frac{1}{K}\sum_{j=1}^{K}\phi_n(g(X_{i,j}),\kappa))$$
- B.i
- bootstrap size deployed on each child job in parallel mode
- sample.n
- number of samples drawn from the subject
level resampling, denoted as $K$ in $J_n(x)$. It
is ignored if
`resample.method="none"`

or`"combn"`

- resample.method
- this defines how the subject level
resampling is performed. The possible values are
`"sample"`

,`"none"`

and`"combn"`

. Let`resample.method = "sample"`

for random sampling with replacement,`"none"`

- mode.resample
- this specifies how the subjects are
counted for subject level leave-k-out random sampling,
and whether the stratification by group is applied. The
possible input values are
`"count.class"`

,`"percent.class"`

or`"no"`

- k.resample
- A numerical value specifies the number
of subjects left out during the subject level resampling.
It is an integer number if
`mode.resample = "count.class"`

and a numerical number between 0 and 1 if`mode.resample`

= "percent.

##### Value

- A data.frame containing two variables:
`weights`

and`positives`

.`weights`

are the $J_n(x)$ values for all genes and positives are indicators to whether a specific $J_n(x)$ is above or below the median of all $J_n(x)$'s.

##### Note

The resampling is applied on two dimensions (see
references): gene level resamping and subject level
resampling. The gene level resampling is straightforward -
each time it takes `d`

number of genes randomly from
all the genes in `X`

. The subject level resampling is
specified by the combination of values given in
`sample.n`

, `resample.method`

,
`mode.resample`

and `k.resample`

. The flat
resampling on all subjects regardless of grouping,
specified by letting `resample.method="none"`

, is
simply a leave-k-out random sampling, where k is given by
`k.resample`

. In more complex cases, the subject level
resampling can be stratified based on the groups defined on
`y`

, in which case, `resample.method`

takes the
value of either `"sample"`

or `"combn"`

. When
`resample.method = "sample"`

, it applies a leave-k-out
random sampling within each group and finally only
`sample.n`

samples are generated from the resampling.
When `resample.method = "combn"`

, all possible
combinations after conditioning on the restrictions given
by `mode.resample`

and `k.resample`

are included.
In this case, the total number of resampled samples varies
depending on the sample size of the study.
`mode.resample="count.class"`

or
`"percent.class"`

defines two ways to calculate the
number of subjects to be left out in the random sampling.
The value of "count.class" indicates the exact number to be
left out and "percent.class" indicates the percentage of
total subjects to be left out. In all cases,
`k.resample`

specifies the number of subjects left out
in the leave-k-out sampling. If `k.resample`

is only a
scalar integer number, the subjects will be sampled with
exactly `k.resample`

subjects left out, either across
all the subjects in the case of flat sampling, or within
each group in the case of stratified resampling by group.
Instead, if `k.resample`

a vector with two integer
numbers, the sampling will leave out the number of subjects
from the two groups based on the two numbers provided. The
order of which number is taken for which group is based on
that the first number is assigned to the first factor level
and the second number is assigned to the second factor
level of `factor(y)`

. Check `factor(y)`

to see
how the two numbers in `k.resample`

would be assigned
to the two groups. A vector with two values for
`k.resample`

produces error if ```
mode.resample =
"flat"
```

. This flexible way of defining the sampling scheme
allows easy specification for balanced sample size between
groups. See references for more details.

##### References

Xu, Y., Guo, X., Sun, J. and Zhao. Z. Snowball: resampling combined with distance-based regression to discover transcriptional consequences of driver mutation, manuscript.

##### Examples

```
require(DESnowball)
data(snowball.demoData)
# check the demo dataset
print(sb.mutation)
head(sb.expression)
## A test run
Bn <- 10000
ncore <-4
# call Snowball
sb <- snowball(y=sb.mutation,X=sb.expression,
ncore=ncore,d=100,B=Bn,
sample.n=1)
# process the gene ranking and selection
sb.sel <- select.features(sb)
# plot the Jn values
plotJn(sb, sb.sel)
# get the significant gene list
top.genes <- toplist(sb.sel)
```

*Documentation reproduced from package DESnowball, version 1.0, License: GPL-3*