Creates and configure all objects needed for a ``variable selection'' problem. It configures Gene, Chromosome, Niche, World, Galgo and BigBang objects.
configBB.VarSelMisc(
file=NULL,
data=NULL,
strata=NULL,
train=rep(2/3,333),
test=1-train,
force.train=c(),
force.test=c(), main="project",
test.error=c(0,1),
train.error=c("kfolds","splits","loocv","resubstitution"),
train.Ksets=-1, # -1 : auto-detection ==> max(min(round(13-n/11),n),3) n=samples
train.splitFactor=2/3,
fitnessFunc=NULL,
scale=FALSE,
geneFunc=runifInt,
chromosomeSize=5,
populationSize=-1,
niches=1,
worlds=1,
immigration=c(rep(0,18),.5,1),
mutationsFunc=function(ni) length(ni),
crossoverFunc=function(ni) round(length(ni)/2,0),
crossoverPoints=round(chromosomeSize/2,0),
offspringScaleFactor=1,
offspringMeanFactor=0.85,
offspringPowerFactor=2,
elitism=c(rep(1,9),.5),
goalFitness=0.90,
galgoVerbose=20,
maxGenerations=200,
minGenerations=10,
galgoUserData=NULL, # additional user data for galgo
maxBigBangs=1000,
maxSolutions=1000,
onlySolutions=FALSE,
collectMode="bigbang",
bigbangVerbose=1,
saveFile="?.Rdata",
saveFrequency=50,
saveVariable="bigbang",
callBackFuncGALGO=function(...) 1,
callBackFuncBB=plot,
callEnhancerFunc=function(chr, parent) NULL,
saveGeneBreaks=NULL,
geneNames=NULL,
sampleNames=NULL,
bigbangUserData=NULL # additional user data for bigbang
)
The file containing the data. First row should be sample names. First column should be variable names (genes). Second row must be the class or strata for every sample if strata
is not provided. The strata is used to balance the train-test sets relative to different strata. If there are only one strata, use the same value for all samples.
If a file is not provided, data
is the a data matrix or data frame with samples in columns and genes in rows (with its respective colnames and rownames set). If data
is provided, strata
must be specified.
if a file is not provided, specifies the classes or strata of the data. If the file
is provided and strata is specified, the second row of the file is considered as data. The strata is used to balance the train-test sets relative to different strata. If there are only one strata, use the same value for all samples.
A vector of the proportion of random samples to be used as training sets. The number of sets is determined by the length of train
. The train+test
should never be greather than 1. All sets are randomly chosen with the same proportion of samples per class than the original sample set.
A vector of the proportion of random samples to be used as testing sets. The number of sets is determined by the length of train
. All sets are randomly chosen with the same proportion of samples per class than the original sample set.
A vector with sample indexes forced to be part of all training sets.
A vector with sample indexes forced to be part of all test sets.
A string or ID related to your project that will be used in all plots and would help you to distinguish results from different studies.
Vector of two weights specifing how the fitness function is evaluated to compute the test error. The first value is the weight of training and the second the weight of test. The default is c(0,1) which consider only test error. The sum of this values should be 1.
Specify how the training set is divided to compute the error in the training set (in evolve
method for Galgo
object). "splits"
compute K
(train.Ksets
) random splits. "loocv"
(leave-one-out-cross-validation) compute K=training samples
. "resubstitution"
no folding at all; it is faster and provided for quick overviews.
The number of training set folds/splits. Negative means automatic detection (n=samples, max(min(round(13-n/11),n),3)).
When train.error=="splits"
, specifies the proportion of samples used in spliting the training set.
Specify the function that would be used to compute the accuracy. The required prototype is function(chr, parent, tr, te, result)
where chr
is the chromosome to be evaluated. parent
would be the BigBang
object where all their variables are exposed. The fitness function commonly use parent$data$data
, which has been trasposed. tr
is the vector of samples (rows) that MUST be used as training and te
the samples that must be used as test.
TRUE
instruct to scale all rows for zero mean and unitary variance. By default, this value is FALSE.
Specify the function that mutate genes. The default is using an integer uniform distribution function (runifInt).
Specify the chromosome size (the number of variables/genes to be included in a model). Defaults to 5. See Gene
and Chromosome
objects.
Specify the number of chromosomes per niche. Defaults is min(20,20+(2000-nrow(data))/400). See Chromosome
and Niche
objects.
Specify the number of niches. Defaults to 2. See Niche
, World
and Galgo
objects.
Specify the number of worlds. Defaults to 1. See World
and Galgo
objects.
Specify the migration criteria.
Specify the function that returns the number of mutations to perform in the population.
Specify the function that returns the number of crossover to perform. The default is the length of the niche divided by 2.
Specify the active positions for crossover operator. Defaults to a single point in the middle of the chromosome. See Niche
object.
Scale factor for offspring generation. Defaults 1. See Niche
object.
Mean factor for offspring generation. Defaults to 0.85. See Niche
object.
Power factor for offspring generation. Defaults to 2. See Niche
object.
Elitism probability/flag/vector. Defaults to c(1,1,1,1,1,1,1,1,1,0.5) (elitism present for 9 generations followed by a 50% chance, then repeated). See Niche
object.
Specify the desired fitness value (fraction of correct classification). Defaults to 0.90. See Galgo
object.
verbose
parameter for Galgo
object.
Maximum number of generations. Defaults to 200. See Galgo
object.
Minimum number of generations. Defaults to 10. See Galgo
object.
Additional user data for the Galgo
object. See Galgo
object.
Maximum number of bigbang cycles. Defaults to 1000. See BigBang
object.
Maximum number of solutions collected. Defaults to 1000. See BigBang
object.
Save only when a solution is reach. Defaults to FALSE (to use all the information, then a filter can be used afterwards). See BigBang
object.
information to collect. Defaults to "bigbang"
. See BigBang
object.
Verbose flag for BigBang
object. Defaults to 1. See BigBang
object.
File name where the data is saved. Defaults to NULL
which implies the name is a concatenation of classification.method
, method specific parameters, file
and ".Rdata"
. See BigBang
object.
How often the ``current'' solutions are saved. Defaults to 50. See BigBang
object.
Internal R
variable name of the saved file. Defaults to ``bigbang''. See BigBang
object.
callBackFunc
for Galgo
object. See Galgo
object.
callBackFunc
for BigBang
object. See BigBang
object.
callEnhancerFunc
for BigBang
object. See BigBang
object.
saveGeneBreaks
vector for BigBang
object. Defaults to NULL
which means to be computed automatically (recommended). See BigBang
object.
The gene (variable) names if they differ from the first column in file
or rownames(data)
.
The sample names if they differ from first row in file
or colnames(data)
.
Additional user data for BigBang
object (stored in $data
variable in BigBang
object returned).
A ready to use bigbang object.
*** TO DO: EXPLAIN THE STRUCTURE OF "DATA" ***
Wrapper function. Configure all objects from parameters.
# NOT RUN {
bb <- configBB.VarSelMisc(...)
bb
blast(bb)
# }
Run the code above in your browser using DataLab