
SES(target, dataset, max_k = 3, threshold = 0.05, test = NULL,
user_test = NULL, hash = FALSE, hashObject = NULL, robust = FALSE, ncores = 1)
MMPC(target, dataset, max_k = 3, threshold = 0.05, test = NULL,
user_test = NULL, hash = FALSE, hashObject = NULL, robust = FALSE, ncores = 1,
backward = FALSE)
CondIndTests
.testIndFisher
for an example.
For all the available conditional independence tests that are currently included on the package, please see CondIndTests
.
If two or more p-values are below the machine epsilon (.Machine$double.eps which is equal to 2.220446e-16), all of them are set to 0. To make the comparison or the ordering feasible we use the logarithm of the p-value. The max-min heuristic though, requires comparison and an ordering of the p-values. Hence, all conditional independence tests calculate the logarithm of the p-value.
If there are missing values in the dataset (predictor variables) columnwise imputation takes place. The median is used for the continuous variables and the mode for categorical variables. It is a naive and not so clever method. For this reason the user is encouraged to make sure his data contain no missing values.
If you have percentages, in the (0, 1) interval, they are automatically mapped into $R$ by using the logit transformation. If you set the test to testIndBeta
, beta regression is used. If you have compositional data, positive multivariate data where each vector sums to 1, with NO zeros, they are also mapped into the Euclidean space using the additive log-ratio (multivariate logit) transformation (Aitchison, 1986).
If you use testIndSpearman (argument "test"), the ranks of the data calculated and those are used in the caclulations. This speeds up the whole procedure.CondIndTests, cv.ses
set.seed(123)
#require(gRbase) #for faster computations in the internal functions
require(hash)
#simulate a dataset with continuous data
dataset <- matrix(runif(1000 * 1000, 1, 100), ncol = 1000)
#define a simulated class variable
target <- 3 * dataset[, 10] + 2 * dataset[, 200] + 3 * dataset[, 20] + rnorm(1000, 0, 5)
#define some simulated equivalences
dataset[, 15] <- dataset[,10] + rnorm(1000, 0, 2)
dataset[, 10] <- dataset[ ,10] + rnorm(1000, 0, 2)
dataset[, 250] <- dataset[,200] + rnorm(1000, 0, 2)
dataset[, 230] <- dataset[,200] + rnorm(1000, 0, 2)
require("hash", quietly = TRUE)
#run the SES algorithm
sesObject <- SES(target , dataset, max_k = 5, threshold = 0.05, test = "testIndFisher",
hash = TRUE, hashObject = NULL);
#print summary of the SES output
summary(sesObject);
#plot the SES output
plot(sesObject, mode = "all");
#get the queues with the equivalences for each selected variable
sesObject@queues
#get the generated signatures
sesObject@signatures;
#get the run time
sesObject@runtime;
#re-run the SES algorithm with the same or different configuration
#under the hash-based implementation of retrieving the statistics
#in the SAME dataset (!important)
#hashObj <- sesObject@hashObject;
#sesObject2 <- SES(target, dataset, max_k = 2, threshold = 0.01, test = "testIndFisher",
#hash = TRUE, hashObject = hashObj);
#retrieve the results: summary, plot, sesObject2@...)
#summary(sesObject2)
#get the run time
#sesObject2@runtime;
#MMPCObject <- MMPC(target , dataset , max_k=3 , threshold=0.05 , test="testIndFisher",
#hash = FALSE, hashObject=NULL);
#MMPCObject@selectedVars
#MMPCObject@runtime
Run the code above in your browser using DataLab