y that is related to the true, unobserved trait yTRUE as follows yTRUE=y+noise where noise is assumed to have mean zero and a constant variance. Assume you have 1 or more surrogate markers for yTRUE corresponding to the columns of datX. The function implements several approaches for estimating yTRUE based on the inputs y and/or datX.TrueTrait(datX, y, datXtest=NULL,
corFnc = "bicor", corOptions = "use = 'pairwise.complete.obs'",
LeaveOneOut.CV=FALSE, skipMissingVariables=TRUE,
addLinearModel=FALSE)datX equals the number of observations, i.e. it should equal the length of ydatX, i.e. the two data sets should have the same number of columns but the number or rows (test set observations) can be d"cor" or biweight mid-correlation "bicor". Additional arguments to the correlation
functicorFnc.y.true1 and y.true2 based on datX.lm(y~., data=datX)y.
The first column y.true1 is the average value of standardized columns of datX where standardization subtracts out the intercept term and divides by the slope of the linear regression model lm(marker~y). Since this estimate ignores the fact that the surrogate markers have different correlations with y, it is typically inferior to y.true2.
The second column y.true2 equals the weighted average value of standardized columns of datX. The standardization is described in section 2.4 of Klemera et al. The weights are proportional to r^2/(1+r^2) where r denotes the correlation between the surrogate marker and y. Since this estimate does not include y as additional surrogate marker, it may be slightly inferior to y.true3. Having said this, the difference between y.true2 and y.true3 is often negligible.
An additional column called y.lm is added if code{addLinearModel=TRUE}. In this case, y.lm reports the linear model predictions.
Finally, the column y.true3 is very similar to y.true2 but it includes y as additional surrogate marker. It is expected to be the best estimate of the underlying true trait (see Klemera et al 2006).datXtest. In this case, it contains a data frame with columns ytrue1 and ytrue2. The
number of rows equals the number of test set observations, i.e the number of rows of datXtest. Since
the value of y is not known in case of a test data set, one cannot calculate y.true3. An
additional column with linear model predictions y.lm is added if code{addLinearModel=TRUE}.LeaveOneOut.CV has been set to TRUE.
In this case, it contains a data frame with leave-one-out cross validation estimates of ytrue1 and ytrue2. The number of rows equals the length of y. Since the value of y is not known in case of a test data set, one cannot calculate y.true3y.true2 and the true (unobserved) yTRUE. It corresponds to formula 33.y.true3 and the true (unobserved) yTRUE. It corresponds to formula 42.datX) when it comes to the definition of y.true2. The rows correspond to the number of variables. Columns report the variable name, the center (intercept that is subtracted to scale each variable), the scale (i.e. the slope that is used in the denominator), and finally the weights used in the weighted sum of the scaled variables.Strata is different from NULL. In this case, it is has the same dimensions as datEstimates but the estimates were calculated separately for each level of Strata.Strata. Each component reports the estimate of SD.ytrue2 for observations in the stratum specified by unique(Strata).y and a list of surrogate markers corresponding to the columns of datX.
2) There is a linear relationship between the true underlying trait and y and the surrogate markers.
3) yTRUE =y +Noise where the Noise term has a mean of zero and a fixed variance.
4) Weighted least squares estimation is used to relate the surrogate markers to the underlying trait where the weights are proportional to 1/ssq.j where ssq.j is the noise variance of the j-th marker.Specifically,
output y.true1 corresponds to formula 31, y.true2 corresponds to formula 25, and y.true3 corresponds to formula 34.
Although the true underlying trait yTRUE is not known, one can estimate the standard deviation between the
estimate y.true2 and yTRUE using formula 33. Similarly, one can estimate the SD for the estimate
y.true3 using formula 42. These estimated SDs correspond to output components 2 and 3, respectively.
These SDs are valuable since they provide a sense of how accurate the measure is.
To estimate the correlations between y and the surrogate markers, one can specify different
correlation measures. The default method is based on the Person correlation but one can also specify the
biweight midcorrelation by choosing "bicor", see help(bicor) to learn more.
When the datX is comprised of observations measured in different strata (e.g. different batches or
independent data sets) then one can obtain stratum specific estimates by specifying the strata using the
argument Strata. In this case, the estimation focuses on one stratum at a time.
Choa IH, Parka KS, Limb CJ (2010) An Empirical Comparative Study on Validation of Biological Age Estimation Algorithms with an Application of Work Ability Index. Mechanisms of Ageing and Development Volume 131, Issue 2, February 2010, Pages 69-78
# observed trait
y=rnorm(1000,mean=50,sd=20)
# unobserved, true trait
yTRUE =y +rnorm(100,sd=10)
# now we simulate surrogate markers around the true trait
datX=simulateModule(yTRUE,nGenes=20, minCor=.4,maxCor=.9,geneMeans=rnorm(20,50,30) )
True1=TrueTrait(datX=datX,y=y)
datTrue=True1$datEstimates
par(mfrow=c(2,2))
for (i in 1:dim(datTrue)[[2]] ){
meanAbsDev= mean(abs(yTRUE-datTrue[,i]))
verboseScatterplot(datTrue[,i],yTRUE,xlab=names(datTrue)[i],
main=paste(i, "MeanAbsDev=", signif(meanAbsDev,3)));
abline(0,1)
}
#compare the estimated standard deviation of y.true2
True1[[2]]
# with the true SD
sqrt(var(yTRUE-datTrue$y.true2))
#compare the estimated standard deviation of y.true3
True1[[3]]
# with the true SD
sqrt(var(yTRUE-datTrue$y.true3))Run the code above in your browser using DataLab