Bicluster data is generated for visualization because the biclusters
are now in block format. That means observations and samples that
belong to a bicluster are consecutive. This allows visual inspection
because the use can identify blocks and whether they have been found
or reconstructed. Essentially the data generation model is the sum of
outer products of sparse vectors:
$$X = \sum_{i=1}^{p} \lambda_i z_i^T + U$$
where the number of summands $p$
is the number of biclusters.
The matrix factorization is
$$X = L Z + U$$
and noise free
$$Y = L Z$$
Here $\lambda_i$ are from $R^n$, $z_i$ from
$R^l$, $L$ from $R^{n \times p}$,
$Z$ from $R^{p \times l}$, and $X$, $U$, $Y$
from $R^{n \times l}$.
Sequentially $L_i$ are generated using
n
, f2
, of2
, sd_l_noise
, mean_l
,
sd_l
.
of2
gives the minimal observations participating in a
bicluster to which between 0 and $n/f2$ observations are added,
where the number is uniformly chosen. sd_l_noise
gives the
noise of observations not participating in the
bicluster. mean_l
and sd_l
determines the Gaussian from
which the values are drawn for the observations that participate in
the bicluster. The sign of the mean is randomly chosen for each
component.
Sequentially $Z_i$ are generated using
l
, f1
, of1
, sd_z_noise
, mean_z
,
sd_z
.
of1
gives the minimal samples participating in a
bicluster to which between 0 and $l/f1$ samples are added,
where the number is uniformly chosen. sd_z_noise
gives the
noise of samples not participating in the
bicluster. mean_z
and sd_z
determines the Gaussian from
which the values are drawn for the samples that participate in
the bicluster.
$U$ is the overall Gaussian zero mean
noise generated by sd_noise
.
Implementation in R.