univariateRankVariables: Univariate analysis of features

Description

This function reports the mean and standard deviation for each feature in a model, and ranks them according to a user-specified score. Additionally, it does a Kolmogorov-Smirnov (KS) test on the raw and z-standardized data. It also reports the raw and z-standardized t-test score, the p-value of the Wilcoxon rank-sum test, the integrated discrimination improvement (IDI), the net reclassification improvement (NRI), the net residual improvement (NeRI), and the area under the ROC curve (AUC). Furthermore, it reports the z-value of the variable significance on the fitted model.

Usage

univariateRankVariables(variableList,
	                        formula,
	                        Outcome,
	                        data, 
	                        categorizationType = c("Raw",
	                                               "Categorical",
	                                               "ZCategorical",
	                                               "RawZCategorical",
	                                               "RawTail",
	                                               "RawZTail"), 
	                        type = c("LOGIT", "LM", "COX"), 
	                        rankingTest = c("zIDI",
	                                        "zNRI",
	                                        "IDI",
	                                        "NRI",
	                                        "NeRI",
	                                        "Ztest",
	                                        "AUC",
	                                        "CStat",
	                                        "Kendall"), 
	                        cateGroups = c(0.1, 0.9),
	                        raw.dataFrame = NULL,
	                        description = ".",
	                        uniType = c("Binary","Regression"),
	                        FullAnalysis=TRUE)

Arguments

variableList

A data frame with the candidate variables to be ranked

formula

An object of class formula with the formula to be fitted

Outcome

The name of the column in data that stores the variable to be predicted by the model

data

A data frame where all variables are stored in different columns

categorizationType

How variables will be analyzed: As given in data ("Raw"); broken into the p-value categories given by cateGroups ("Categorical"); broken into the p-value categories given by cateGroups, and weighted by the z-score ("ZCategorical"); broken into the p-value categories given by cateGroups, weighted by the z-score, plus the raw values ("RawZCategorical"); raw values, plus the tails ("RawTail"); or raw values, wighted by the z-score, plus the tails ("RawZTail")

type

Fit type: Logistic ("LOGIT"), linear ("LM"), or Cox proportional hazards ("COX")

rankingTest

Variables will be ranked based on: The z-score of the IDI ("zIDI"), the z-score of the NRI ("zNRI"), the IDI ("IDI"), the NRI ("NRI"), the NeRI ("NeRI"), the z-score of the model fit ("Ztest"), the AUC ("AUC"), the Somers' rank correlation ("Cstat"), or the Kendall rank correlation ("Kendall")

cateGroups

A vector of percentiles to be used for the categorization procedure

raw.dataFrame

A data frame similar to data, but with unadjusted data, used to get the means and variances of the unadjusted data

description

The name of the column in variableList that stores the variable description

uniType

Type of univariate analysis: Binary classification ("Binary") or regression ("Regression")

FullAnalysis

If FALSE it will only order the features according to its z-statistics of the linear model

Value

Name: Name of the raw variable or of the dummy variable if the data has been categorized
parent: Name of the raw variable from which the dummy variable was created
descrip: Description of the parent variable, as defined in description
cohortMean: Mean value of the variable
cohortStd: Standard deviation of the variable
cohortKSD: D statistic of the KS test when comparing a normal distribution and the distribution of the variable
cohortKSP: Associated p-value to the cohortKSD
caseMean: Mean value of cases (subjects with Outcome equal to 1)
caseStd: Standard deviation of cases
caseKSD: D statistic of the KS test when comparing a normal distribution and the distribution of the variable only for cases
caseKSP: Associated p-value to the caseKSD
caseZKSD: D statistic of the KS test when comparing a normal distribution and the distribution of the z-standardized variable only for cases
caseZKSP: Associated p-value to the caseZKSD
controlMean: Mean value of controls (subjects with Outcome equal to 0)
controlStd: Standard deviation of controls
controlKSD: D statistic of the KS test when comparing a normal distribution and the distribution of the variable only for controls
controlKSP: Associated p-value to the controlsKSD
controlZKSD: D statistic of the KS test when comparing a normal distribution and the distribution of the z-standardized variable only for controls
controlZKSP: Associated p-value to the controlsZKSD
t.Rawvalue: Normal inverse p-value (z-value) of the t-test performed on raw.dataFrame
t.Zvalue: z-value of the t-test performed on data
wilcox.Zvalue: z-value of the Wilcoxon rank-sum test performed on data
ZGLM: z-value returned by the lm, glm, or coxph functions for the z-standardized variable
zNRI: z-value returned by the improveProb function (Hmisc package) when evaluating the NRI
zIDI: z-value returned by the improveProb function (Hmisc package) when evaluating the IDI
zNeRI: z-value returned by the improvedResiduals function when evaluating the NeRI
ROCAUC: Area under the ROC curve returned by the roc function (pROC package)
cStatCorr: c index of Somers' rank correlation returned by the rcorr.cens function (Hmisc package)
NRI: NRI returned by the improveProb function (Hmisc package)
IDI: IDI returned by the improveProb function (Hmisc package)
NeRI: NeRI returned by the improvedResiduals function
kendall.r: Kendall $\tau$ rank correlation coefficient between the variable and the binary outcome
kendall.p: Associated p-value to the kendall.r
TstudentRes.p: p-value of the improvement in residuals, as evaluated by the paired t-test
WilcoxRes.p: p-value of the improvement in residuals, as evaluated by the paired Wilcoxon rank-sum test
FRes.p: p-value of the improvement in residual variance, as evaluated by the F-test
caseN_Z_Low_Tail: Number of cases in the low tail
caseN_Z_Hi_Tail: Number of cases in the top tail
controlN_Z_Low_Tail: Number of controls in the low tail
controlN_Z_Hi_Tail: Number of controls in the top tail
Name: Name of the raw variable or of the dummy variable if the data has been categorized
parent: Name of the raw variable from which the dummy variable was created
descrip: Description of the parent variable, as defined in description
cohortMean: Mean value of the variable
cohortStd: Standard deviation of the variable
cohortKSD: D statistic of the KS test when comparing a normal distribution and the distribution of the variable
cohortKSP: Associated p-value to the cohortKSP
cohortZKSD: D statistic of the KS test when comparing a normal distribution and the distribution of the z-standardized variable
cohortZKSP: Associated p-value to the cohortZKSD
ZGLM: z-value returned by the glm or Cox procedure for the z-standardized variable
zNRI: z-value returned by the improveProb function (Hmisc package) when evaluating the NRI
NeRI: NeRI returned by the improvedResiduals function
cStatCorr: c index of Somers' rank correlation returned by the rcorr.cens function (Hmisc package)
spearman.r: Spearman $\rho$ rank correlation coefficient between the variable and the outcome
pearson.r: Pearson r product-moment correlation coefficient between the variable and the outcome
kendall.r: Kendall $\tau$ rank correlation coefficient between the variable and the outcome
kendall.p: Associated p-value to the kendall.r
TstudentRes.p: p-value of the improvement in residuals, as evaluated by the paired t-test
WilcoxRes.p: p-value of the improvement in residuals, as evaluated by the paired Wilcoxon rank-sum test
FRes.p: p-value of the improvement in residual variance, as evaluated by the F-test

Details

This function will create valid dummy categorical variables if, and only if, data has been z-standardized. The p-values provided in cateGroups will be converted to its corresponding z-score, which will then be used to create the categories. If non z-standardized data were to be used, the categorization analysis would return wrong results.

References

Pencina, M. J., D'Agostino, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine 27(2), 157-172.

Examples

Run this code

	## Not run: 
# 	# Start the graphics device driver to save all plots in a pdf format
# 	pdf(file = "Example.pdf")
# 	# Get the stage C prostate cancer data from the rpart package
# 	library(rpart)
# 	data(stagec)
# 	# Split the stages into several columns
# 	dataCancer <- cbind(stagec[,c(1:3,5:6)],
# 	                    gleason4 = 1*(stagec[,7] == 4),
# 	                    gleason5 = 1*(stagec[,7] == 5),
# 	                    gleason6 = 1*(stagec[,7] == 6),
# 	                    gleason7 = 1*(stagec[,7] == 7),
# 	                    gleason8 = 1*(stagec[,7] == 8),
# 	                    gleason910 = 1*(stagec[,7] >= 9),
# 	                    eet = 1*(stagec[,4] == 2),
# 	                    diploid = 1*(stagec[,8] == "diploid"),
# 	                    tetraploid = 1*(stagec[,8] == "tetraploid"),
# 	                    notAneuploid = 1-1*(stagec[,8] == "aneuploid"))
# 	# Remove the incomplete cases
# 	dataCancer <- dataCancer[complete.cases(dataCancer),]
# 	# Load a pre-stablished data frame with the names and descriptions of all variables
# 	data(cancerVarNames)
# 	# Rank the variables:
# 	# - Analyzing the raw data
# 	# - According to the zIDI
# 	rankedDataCancer <- univariateRankVariables(variableList = cancerVarNames,
# 	                                            formula = "Surv(pgtime, pgstat) ~ 1",
# 	                                            Outcome = "pgstat",
# 	                                            data = dataCancer, 
# 	                                            categorizationType = "Raw", 
# 	                                            type = "COX", 
# 	                                            rankingTest = "zIDI",
# 	                                            description = "Description")
# 	# Shut down the graphics device driver
# 	dev.off()## End(Not run)

Run the code above in your browser using DataLab