mcfs: MCFS-ID (Monte Carlo Feature Selection and Interdependency Discovery)

Description

Performs Monte Carlo Feature Selection (MCFS-ID) on a given data set. The data set should define a classification problem with discrete/nominal class labels. This function returns features sorted by RI as well as cutoff value, ID-Graph edges that denote interdependencies (ID), evaluation of top features and other statistics. For a detailed description of the MCFS-ID algorithm see citation below. If you want to use dmLab or MCFS-ID in your publication, please cite the paper:

M.Draminski, A.Rada-Iglesias, S.Enroth, C.Wadelius, J. Koronacki, J.Komorowski, 'Monte Carlo feature selection for supervised classification', BIOINFORMATICS 24(1): 110-117 (2008).

Usage

mcfs(formula, data,
    projections = 3000,
    projectionSize = 0.05,
    splits = 5,
    splitRatio = 0.66,
    balanceRatio = 1,
    splitSetSize = 1000,
    cutoffPermutations = 20,
    cutoffMethod = c("permutations", "criticalAngle", "kmeans", "mean"),
    buildID = TRUE,
    finalRuleset = TRUE,
    finalCV = TRUE,
    finalCVSetSize = 1000,
    finalCVRepetitions = 3,
    u = 1,
    v = 1,
    seed = NA,
    threadsNumber = 2)

Arguments

formula

specifies decision attribute and relation between class and other attributes (e.g. class~.).

data

defines input data.frame containing all features with decision attribute included. This data.frame must contain proper types of columns. Columns character, factor, Date, POSIXct, POSIXt

projections

defines the number of subsets (projections) with randomly selected features. This parameter is usually set to a few thousands and is denoted in the paper as s.

projectionSize

defines the number of features in one subset. It can be defined by an absolute value (e.g. 100 denotes 100 randomly selected features) or by a fraction of input attributes (e.g. 0.05 denotes 5% of input features). This parameter is denoted in the paper as

splits

defines the number of splits of each subset. This parameter is denoted in the paper as t.

splitRatio

defines the size of the training set as a fraction of objects in the input subset.

balanceRatio

determines the way to balance classes. It should be set to 2 or higher if input dataset contains heavily unbalanced classes. Each subset s will contain all the objects from the least frequent class and randomly selected set of objects from each o

splitSetSize

determines whether to limit input dataset size. It helps to speedup computation for data sets with a large number of objects. If the parameter is larger than 1, it determines the number of objects that are drawn at random for each of the $s \cdot t$ decis

cutoffPermutations

determines the number of permutation runs. It needs at least 20 permutations (cutoffPermutations = 20) for a statistically significant result.

cutoffMethod

determines the final cutoff method. Default value is 'permutations'. The methods of finding cutoff value between important and unimportant attributes are the following:

mean- cutoff value is set on mean values obtained from all t

buildID

if = TRUE, Interdependencies Discovery is on and all ID-Graph edges are collected.

finalRuleset

if = TRUE, classification rules (by ripper algorithm) are created on the basis of the final set of features.

finalCV

if = TRUE, it runs cross validation (cv) experiments on the final set of features. The following set of classifiers is used: C4.5, NB, SVM, kNN, logistic regression and Ripper.

finalCVSetSize

limits the number of objects used in the final cv experiment. For each cv repetition, the objects are selected randomly from the uniform distribution.

finalCVRepetitions

defines the number of repetitions of the cv experiment. The more repetitions, the more stable result.

exponent for scaling the influence of tree accuracy on the RI of a feature (see the paper). By taking u = 2, trees with low accuracy are penalized more severely than when taking u = 1. Highly recommended is the default value.

exponent for scaling the influence of a number of samples in a node on the RI of a feature (see the paper). By taking v = 2, nodes with smaller number of examples are penalized more severely than when taking v = 1. Highly recommended is the default value.

seed

seed for random number generator in Java. By default the seed is random. Replication of the result is possible only if threadsNumber = 1.

threadsNumber

number of threads to use in computation. More threads needs more CPU cores as well as memory usage is a bit higher. It is recommended to set this value equal to or less than CPU available cores.

Value

targetdecision attribute name.
RIdata.frame that contains all features with relevance scores sorted from the most relevant to the least relevant. This is the ranking of features.
IDdata.frame that contains features interdependencies as graph edges. It can be converted into a graph object by build.idgraph function.
distancesdata.frame that contains convergence statistics of subsequent projections.
cmatrixconfusion matrix obtained from all $s \cdot t$ decision trees.
cutoffdata.frame that contains cutoff values obtained by the following methods: mean, kmeans, criticalAngle, permutations (max RI). Disregard the forth value (contrastAttributes), since its calculation is not fully developed.
cutoff_valuethe number of features chosen as informative by the method defined by parameter cutoffMethod.
cv_accuracydata.frame that contains classification results obtained by cross validation performed on cutoff_value features. This data.frame exists if finalCV = T.
permutations
this data.frame contains the following results of permutation experiments:
- perm_xall RI values obtained from all permutation experiments;
- RI_normRI obtained for reference MCFS experiment (i.e, the experiment on the original data); p-values from Anderson-Darling normality test applied separately for each feature to thecutoffPermutationsRI set;
- t_test_p$p$-values from Student-t test applied separately for each feature to thecutoffPermutationsRI vs. reference RI. Thisdata.frameexists if parametercutoffPermutations > 0.
jripclassification rules produced by ripper algorithm and related cross validation result obtained for top features.
exec_timeexecution time of MCFS-ID.

Examples

Run this code

####################################
  ######### Artificial data ##########
  ####################################
  
  ### Set up java parameter and load rmcfs package
  options(java.parameters = "-Xmx4g")
  library(rmcfs)
  
  # create input data
  adata <- artificial.data(rnd.features = 10)
  info(adata)
  
  result <- mcfs(class~., adata, projections = 300, projectionSize = 4, 
                 cutoffPermutations = 5, finalCV = TRUE, finalRuleset = TRUE, 
                 threadsNumber = 2)

  # Print basic information about mcfs result.
  print(result)
  
  # Review cutoff values for all methods
  print(result$cutoff)
  
  # Review cutoff value used in plots
  print(result$cutoff_value)
  
  # Plot & print out distances between subsequent projections. 
  # These are convergence MCFS-ID statistics.
  plot(result, type="distances")
  print(result$distances)
  
  # Plot & print out 50 most important features.
  plot(result, type="ri", size = 50)
  # Show max RI values from permutation experiment.
  plot(result, type = "ri", size = 50, plot_permutations = TRUE)
  print(head(result$RI, 50))
  
  # Plot & print out 50 strongest feature interdependencies.
  plot(result, type = "id", size = 50)
  print(head(result$ID, 50))
  
  # Plot features ordered by RI_norm. Parameter 'size' is the number of 
  # top features in the chart. We set this parameter a bit larger than cutoff_value.
  plot(result, type = "features", size = result$cutoff_value * 1.1, cex = 1)
  # Here we set 'size' at fixed value 10.
  plot(result, type = "features", size = 10)
  
  # Plot cv classification result obtained on top features.
  # In the middle of x axis red label denotes cutoff_value.
  plot(result, type = "cv", measure = "wacc", cex = 0.8)
  
  # Plot & print out confusion matrix. This matrix is the result of 
  # all classifications performed by all decision trees on all s*t datasets.
  plot(result, type = "cmatrix")
  
  # build interdependencies graph (all default parameters).
  gid <- build.idgraph(result)
  plot(gid)
  
  # build interdependencies graph for top 6 features 
  # and top 12 interdependencies and plot all nodes
  gid <- build.idgraph(result, size = 6, size_ID = 12, plot_all_nodes = TRUE)
  plot(gid, label.dist = 1)

  # Export graph to graphML (XML structure)
  path <- tempdir()
  igraph::write.graph(gid, file = paste0(path, "/artificial.graphml"), 
              format = "graphml", prefixAttr = FALSE)
  
  # Export and import results to/from csv files
  export.result(result, path = path, label = "artificial", save.rds = FALSE)
  result <- import.result(path = path, label = "artificial")

  ####################################
  ########## Alizadeh data ###########
  ####################################
  # Load Alizadeh dataset.
  data(alizadeh)
  info(alizadeh)
  
  # Fix data types and data values - replace characters such as "," " " "/" etc. 
  # from values and column names and fix data types
  # This function may help if mcfs has any problems with input data
  alizadeh <- fix.data(alizadeh)
  
  # Parametrize and run MCFS-ID procedure, projectionSize (m) is set at 5 percent 
  # of input columns. For larger data (thousands of features) default settings are good enough.
  # This example may take few minutes but this one is a real dataset
  result <- mcfs(class~., alizadeh, projections = 3000, projectionSize = 0.05, 
                  cutoffPermutations = 20, threadsNumber = 8)

  # Print basic information about mcfs result.
  print(result)

Run the code above in your browser using DataLab