
Use mean, standard deviation, skewness, kurtosis, Hellinger distance and KS test to compare similarity of two data sets.
dataSimilarity(data1, data2, dropDiscrete=NA)
A data.frame
containing the reference data.
A data.frame
with the same number and names of columns as data1
.
A vector discrete attribute indices to skip in comparison. Typically we might skip class, because its distribution was forced by the user.
The method returns a list of statistics computed on both data sets:
The number of instances in data2
equal to the instances in data1
.
A matrix with rows containing statistics (mean, standard deviation, skewness, and kurtosis) computed on numeric attributes of data1
.
A matrix with rows containing statistics (mean, standard deviation, skewness, and kurtosis) computed on numeric attributes of data2
.
A vector with p-values of Kolmogorov-Smirnov two sample tests, performed on matching attributes from data1
and data2
.
A list with value frequencies for discrete attributes in data1
.
A list with value frequencies for discrete attributes in data2
.
A list with differences in frequencies of discrete attributes' values between data1
and data2
.
A matrix with rows containing difference between statistics (mean, standard deviation, skewness, and kurtosis)
computed on [0,1] normalized numeric attributes for data1
and data2.
A vector with Hellinger distances between matching attributes from data1
and data2
.
The function compares data stored in data1
with data2
on per attribute basis by
computing several statistics:
mean, standard deviation, skewness, kurtosis, Hellinger distance and KS test.
# NOT RUN {
# use iris data set, split into training and testing data
set.seed(12345)
train <- sample(1:nrow(iris),size=nrow(iris)*0.5)
irisTrain <- iris[train,]
irisTest <- iris[-train,]
# create RBF generator
irisGenerator<- rbfDataGen(Species~.,irisTrain)
# use the generator to create new data
irisNew <- newdata(irisGenerator, size=100)
# compare statistics of original and new data
dataSimilarity(irisTest, irisNew)
# }
Run the code above in your browser using DataLab