Create a data object suitable for ptycho.all
.
createData(X, y, omega = NULL, beta = NULL)
createDataBayesModel(mode = c("exchange","pleiotropy","gene"), n, p, q,
nreps, tau.min, tau.max, G)
createPubData(mode = c("tinysim","ptychoIn",
"exchange","pleiotropy","gene",
"actualGeno","actualPheno","corTest",
"fixedOmega","uniformEffects"),
X=NULL, y=NULL, var.detail=NULL, variants=NULL)
List containing:
Design matrix
Number of columns in each response
Standard deviation of the simulated noise; NULL
if
input y
is numeric
Input omega
Input beta
List of length y$nreps
(length 1 if y
is
numeric), each entry of which is a list with the following components:
omega
Matrix containing probabilities of association between
covariates and responses; row names are colnames(X)
and column
names are colnames(y)
; NULL
if input y
is numeric
indic.var
Matrix containing ones for associations and zeros
otherwise; row and column names are same as for omega
;
NULL
if input y
is numeric
beta
Matrix of effect sizes; row and column names are same
as for omega
; NULL
if input y
is numeric
y
Response matrix
eta2
Matrix with row names equal to colnames(X)
and
column names equal to colnames(y)
For createDataBayesModel
with mode
that uses a second level of
indicator variables, each entry in the replicate
list also has
components omega.grp
and indic.grp
containing the intermediate
steps of drawing the second-level indicator variable before drawing
omega
. If the argument beta
to createData
is
“createBetaNormal” (which it is when called by
createDataBayesModel
), then each replicate will also have a component
tau
giving the value drawn by a call to
runif(1, tau.min, tau.max)
.
Design matrix or alist specifying how to generate such a matrix. If
a list, the first entry is a function name and the second is a list of
arguments to the function. In createPubData
, X
is ignored
unless mode
is “actualGeno”, “actualPheno”, or
corTest
.
Numeric vector or matrix or list with the following components:
nreps
Number of replicates to simulate
q
Number of responses to generate for each replicate
sd
Standard deviation of the simulated noise
In createPubData
, y
is ignored unless mode
is
“actualPheno”.
Numeric vector or matrix or list specifying how to generate a
list with the component omega
; see Details for its meaning. If this
is a list, the first entry is a function name and the second is a list of
arguments to the function, which will be prepended by the number of rows in
output X
and the number of columns in output y
. Only used if
y
is a list.
List specifying how to generate a matrix of effect sizes. The
first entry of the list is a function name and the second is a list of
arguments to the function, which will be prepended by a matrix specifying
the variables selected and y$sd
. Only used if y
is a list.
Number of observations to simulate
Number of covariates to simulate
Number of responses to simulate for each replicate
Number of replicates to simulate
String specifying type of dataset to create:
tinysim
Simulated data included with this package;
equivalent to mode pleiotropy
except that the dataset is tiny,
with n=100
, p=10
, q=5
, and nreps=10
ptychoIn
Simulated data included with this package;
equivalent to mode gene
except that the dataset is tiny,
with n=3000
, p=10
, q=1
, and nreps=1
exchange
Create orthogonal X
and exchangeable
variants; n=5000
, p=50
, q=5
, and nreps=100
pleiotropy
Create orthogonal X
, and several variants
have nonzero effects on multiple responses; n=5000
, p=50
,
q=5
, and nreps=100
gene
Create orthogonal X
, and each group of variants
typically has either several or no variants that effect a response;
n=5000
, p=50
, q=5
, and nreps=100
actualGeno
Simulate responses for input X
corTest
Simulate q=2
responses for input X
.
There will be 10 replicates with the first variant in argument
variants
causal for both responses, 10 with the second variant
causal, and 20 with variant i
causal for response i
. No
other variant will be causal.
actualPheno
Put input X
and y
into data
object
fixedOmega
Create orthogonal X
, and each variant has
a certain probability of a nonzero effect size
uniformEffects
Same as mode fixedOmega
except that
effect sizes are uniformly rather than normally distributed
For createDataBayesModel
, mode
must be one of
“exchange”, “pleiotropy”, or “gene”.
Endpoints of uniform distribution from which to draw
tau
Number of groups of covariates; unused if mode
is not
“gene”
Data frame with row names same as column names of X
;
must have columns “MAF” and “GENE”. Ignored unless
mode
is “actualGeno”.
Character vector containing names of two columns of X
;
ignored unless mode
is “corTest”.
Laurel Stell and Chiara Sabatti
Maintainer: Laurel Stell <lstell@stanford.edu>
We describe createData
and then describe its wrappers
createDataBayesModel
and createPubData
.
Although createData
can form the data object required by
ptycho.all
when X
and y
are input, it primarily exists to
simplify simulating data from \(Y=X\beta+\epsilon\), where
\(\epsilon\) is normal with mean zero and specified standard deviation and
\(\beta\) is sparse with entries simulated as specified.
The function generates a specified number of replicates, all of which use the same design matrix \(X\). If this matrix is not input, then its argument must specify a function call to generate it. In either case, suppose \(X\) has \(n\) rows and \(p\) columns.
If the input y
is numeric, then it will be used for the lone replicate.
If it is a matrix, it must have \(n\) rows; let \(q\) be its number of
columns. If input y
is a numeric vector, it must have \(n\) entries
and will be cast as a matrix with \(q=1\) column. Otherwise, input y
is a list specifying, along with the arguments omega
and beta
,
how to simulate the response(s). Because it is useful in analysis of the
estimation of the marginal posterior distribution, the returned object always
contains, regardless of how X
and y
are specified, a matrix
eta2
with \((j,k)\) entry equal to
\(\mathbf{x}_j^T \mathbf{y}_k / (n \mathbf{y}_k^T \mathbf{y}_k)\)
If y
is to be simulated, the first step is to choose the probability
that each covariate is associated with each reponse as specified by the input
argument omega
. If this argument is a matrix, it must have size
\(p\)-by-\(q\). If it is not a matrix but is numeric, it will be passed
to matrix
to create a matrix of the correct size. Otherwise,
the matrix for each replicate will be generated by calling the function whose
name is given by omega[[1]]
with argument list
(p, q, omega[[2]])
. This function must return a list with component
omega
set to a \(p\)-by-\(q\) matrix; the list may also contain
additional components. The package contains several functions whose names
start with “createOmega” that might guide users in writing their own
functions.
The next step is to draw a \(p\)-by-\(q\) matrix indic.var
whose
\((j,k)\) entry is equal to one with probability omega[j,k]
and zero
otherwise. This matrix will be drawn until all column sums are positive.
For each entry in indic.var
that is equal to one, the effect size must
be drawn. This is done by calling the function whose name is given by
beta[[1]]
with argument list (indic.var, y$sd, beta[[2]])
. This
function must return a list with component beta
set to a
\(p\)-by-\(q\) matrix; the list may also contain additional components.
If indic.var[j,k]
is zero, then beta[j,k]
should be zero. The
package contains functions whose names start with “createBeta” that
might guide users in writing their own functions.
Finally, an \(n\)-by-\(q\) matrix of noise is drawn from
\(N(0,\sigma^2)\), where \(\sigma\) is the input noise.sd
, and
added to \(X\beta\) to obtain y
. The column names of each
response matrix generated will be y1
, y2
, and so forth.
The function createPubData
generates the data sets used in Stell and
Sabatti (2015). For mode
equal to “exchange”,
“pleiotropy”, or “geno”, it calls createData
via
createDataBayesModel
; otherwise, it calls createData
directly.
These functions also serve as additional examples of the use of
createData
. For reproducibility, createPubData
first sets the
random seed to 1234, except that it is set to 4 when mode
equals
“ptychoIn” and it does not set it when mode
equals
“corTest”.
In createDataBayesModel
, if mode
is “exchange”, then
one \(\omega \sim \mbox{Beta}(12,48)\) is drawn
independently for each trait. If mode
is “pleiotropy”, then one
probability of association for a trait is drawn from Beta(16,55) for each data
set, that probability is used to draw indic.grp
for each variant, and
then the probability of nonzero indic.var[j,k]
is drawn from
Beta(48,12) for each nonzero indic.grp[j]
. Finally, if mode
is
“gene”, the process is analogous to pleiotropy except that each trait
is simulated independently.
Stell, L. and Sabatti, C. (2015) Genetic variant selection: learning across traits and sites, arXiv:1504.00946.
createOrthogonalX
, createGroupsSim
;
also Data describes tinysim
in example below as well as another
object output by createData
### EXAMPLE 1
data(tinysim)
# Data generated with mode equal to pleiotropy, so indic.grp exists and
# has an entry for each column in X.
colnames(tinysim$X)
tinysim$replicates[[5]]$indic.grp
# X4, X6, and X9 are associated with some responses.
tinysim$replicates[[5]]$indic.var
### EXAMPLE 2
# Generate miniature data set with information shared across covariates.
set.seed(1234)
tiny1 <- createDataBayesModel(mode="gene", n=100, p=10, q=5, nreps=10,
tau.min=0.045, tau.max=0.063, G=2)
# A covariate can only have indic.var=1 if the group it belongs to has
# indic.grp=1. For example,indic.grp[1,4]=0 implies
# indic.var[groups$group2var[1],4]=0.
tiny1$replicates[[1]]$indic.grp
tiny1$omega[[2]]$groups$group2var[1]
tiny1$replicates[[1]]$indic.var
### EXAMPLE 3
# Alternatively, call createData directly
groups <- createGroupsSim(G=2, p=10)
omegaargs <- list(indic.grp.shape1=16, indic.grp.shape2=55,
shape1=48, shape2=12, groups=groups)
betaargs <- list(tau.min=0.045, tau.max=0.063)
set.seed(1234)
tiny2 <- createData(X=list("createOrthogonalX", list(n=100, p=10)),
y=list(nreps=10, q=5, sd=1),
omega=list("createOmegaCrossVars", omegaargs),
beta=list("createBetaNormal", betaargs))
identical(tiny1, tiny2)
### SEE THE CODE FOR createPubData FOR MORE EXAMPLES.
Run the code above in your browser using DataLab