configBB.VarSel: Creates and configure all objects needed for a ``variable selection for classificacion'' problem

Description

Creates and configure all objects needed for a ``variable selection for classificacion'' problem. It configures Gene, Chromosome, Niche, World, Galgo and BigBang objects.

Usage

configBB.VarSel(
	file=NULL, 
	data=NULL, 
	classes=NULL, 
	train=rep(2/3,333), 
	test=1-train, 
	force.train=c(),
	force.test=c(),
	train.cases=FALSE, 
	main="project",
	classification.method=c("knn","mlhd","svm","nearcent",
                          "rpart","nnet","ranforest","user"),
	classification.test.error=c(0,1),
	classification.train.error=c("kfolds","splits","loocv","resubstitution"),
	classification.train.Ksets=-1, 
	classification.train.splitFactor=2/3, 
	classification.rutines=c("C","R"),
	classification.userFitnessFunc=NULL,
	scale=(classification.method[1] %in% c("knn","nearcent","mlhd","svm")), 
	knn.k=3,
	knn.l=1,
	knn.distance=c("euclidean", "maximum", "manhattan", 
    "canberra", "binary", "minkowski", "pearson", "kendall", "spearman", 
    "absolutepearson","absolutekendall", "absolutespearman"),
	nearcent.method=c("mean","median"),
	svm.kernel=c("radial","polynomial","linear","sigmoid"),
	svm.type=c("C-classification", "nu-classification", "one-classification"),
	svm.nu=0.5,
	svm.degree=4,
	svm.cost=1,
	nnet.size=2,
	nnet.decay=5e-4,
	nnet.skip=TRUE,
	nnet.rang=0.1,
	geneFunc=runifInt,
	chromosomeSize=5, 
	populationSize=-1, 
	niches=1, 
	worlds=1,
	immigration=c(rep(0,18),.5,1), 
	mutationsFunc=function(ni) length(ni),
	crossoverFunc=function(ni) round(length(ni)/2,0),
	crossoverPoints=round(chromosomeSize/2,0), 
	offspringScaleFactor=1,
	offspringMeanFactor=0.85,
	offspringPowerFactor=2,
	elitism=c(rep(1,9),.5),
	goalFitness=0.90, 
	galgoVerbose=20, 
	maxGenerations=200, 
	minGenerations=10, 
	galgoUserData=NULL, 
	maxBigBangs=1000, 
	maxSolutions=1000, 
	onlySolutions=FALSE, 
	collectMode="bigbang", 
	bigbangVerbose=1, 
	saveFile="?.Rdata", 
	saveFrequency=50,
	saveVariable="bigbang",
	callBackFuncGALGO=function(...) 1,
	callBackFuncBB=plot,
	callEnhancerFunc=function(chr, parent) NULL,
	saveGeneBreaks=NULL,
	geneNames=NULL,
	sampleNames=NULL,
	bigbangUserData=NULL 
	)

Arguments

file

The file containing the data. First row should be sample names. First column should be variable names (genes). Second row must be the class for every sample if classes is not provided.

data

If a file is not provided, data is the a data matrix or data frame with samples in columns and genes in rows (with its respective colnames and rownames set). If data is provided, class must be specified.

classes

if a file is not provided, specifies the classes for the data. If the file is provided and classes is specified, the second row of the file is considered as data.

train

A vector of the proportion of random samples to be used as training sets. The number of sets is determined by the length of train. The train+test should never be greather than 1. All sets are randomly chosen with the same proportion of samples per class than the original sample set.

test

A vector of the proportion of random samples to be used as testing sets. The number of sets is determined by the length of train. All sets are randomly chosen with the same proportion of samples per class than the original sample set.

force.train

A vector with sample indexes forced to be part of all training sets.

force.test

A vector with sample indexes forced to be part of all test sets.

train.cases

If TRUE, the same number of cases for each class. If numeric vector, then it is interpreted as the number of samples in training per class

main

A string or ID related to your project that will be used in all plots and would help you to distinguish results from different studies.

classification.method

The method to be used for classification. The current available methods (in this package) are "knn", "mlhd", "svm", "nearcent" (nearest centroid), "rpart" (recursive partitioning trees), and "nnet" (neural networks, experimental, not recommendable), "ranforest" for Random Forest, "user" is for classification problems but the user provides a specific function.

classification.test.error

Vector of two weights specifing how the fitness function is evaluated to compute the test error. The first value is the weight of training and the second the weight of test. The default is c(0,1) which consider only test error. The sum of this values should be 1.

classification.train.error

Specify how the training set is divided to compute the error in the training set (in evolve method for Galgo object). The fitness function really compute 1-error where error is always computed from the proportion of samples that has been incorrectly classified. "kfolds" (k-fold-cross-validation) compute K non overlapping sets (classification.train.Ksets) attempting to conserve class proportions. "splits" compute K (classification.train.Ksets) random splits. "loocv" (leave-one-out-cross-validation) compute K=training samples. "resubstitution" no folding at all; it is faster and provided for quick overviews.

classification.train.Ksets

The number of training set folds/splits. Negative means automatic detection (n=samples, max(min(round(13-n/11),n),3)).

classification.train.splitFactor

When classification.train.error=="splits", specifies the proportion of samples used in spliting the training set.

classification.rutines

For most of the methods, R and C code has been provided. C code is preferred for performance reason, however finding mistakes is easier in R. Besides, the example code could be used as a guide for new user fitness functions. "rpart" has not C code. "svm" has only some improvments removing redundancy checks.

classification.userFitnessFunc

For classification.method == "user", specify the function that would be used to compute the accuracy and class prediction. The required prototype is function(chr, parent, tr, te, result) where chr is the chromosome to be evaluated, a convertion using as.numeric is commonly needed to extract the exact values from the chromosome. parent would be the BigBang object where all their variables are exposed. The fitness function commonly use parent$data$data, which has been trasposed. tr is the vector of samples (rows) that MUST be used as training and te the samples that must be used as test. They can correspond to training and test in the evolution or in any other context (as the computation of the confusion matrix or the forward selection). The fitness function should return the result in two different formats, which is specified in the result parameter. result is 0 (zero) when the predicted class for the test is required (as an integer, not as a factor) otherwise the it is expected the number of correctly classified samples from the test vector.

scale

TRUE instruct to scale all rows for zero mean and unitary variance. By default, scale is TRUE when classification.method is "knn","nearcent","mlhd", or "svm".

knn.k

For KNN method, knn.k is the number of nearest neighbours to consider.

knn.l

For KNN method, knn.l is the number of minimum neighbours needed to predict a class.

knn.distance

The distance to be used in KNN method. Possible values are "euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski", "pearson", "kendall", "spearman", "absolutepearson","absolutekendall", "absolutespearman" (see dist method).

nearcent.method

For nearest centroid method, nearcent.method specify the method for computing the centroid ("mean", "median").

nnet.size

Parameter passed to nnet.

nnet.decay

Parameter passed to nnet.

nnet.skip

Parameter passed to nnet.

nnet.rang

Parameter passed to nnet.

svm.kernel

For SVM (support vector machines) method, specify the kernel method "radial","polynomial","linear" or "sigmoid" (see svm method in e1071 package).

svm.type

For SVM method, specify the type of classificacion.

svm.nu

For SVM method and nu-classification specify the nu value.

svm.degree

For SVM method and polynomial kernel, specify the degreee value.

svm.cost

For SVM method, specify the C value (cost).

nnet.

Parameters for neural networks classification. See nnet package.

geneFunc

The function that provides random values for genes. The default is runifInt, which generates a random integer value with a uniform distribution.

chromosomeSize

Specify the chromosome size (the number of variables/genes to be included in a model). Defaults to 5. See Gene and Chromosome objects.

populationSize

Specify the number of chromosomes per niche. Defaults is min(20,20+(2000-nrow(data))/400). See Chromosome and Niche objects.

niches

Specify the number of niches. Defaults to 2. See Niche, World and Galgo objects.

worlds

Specify the number of worlds. Defaults to 1. See World and Galgo objects.

immigration

Specify the migration criteria.

mutationsFunc

Specify the function that returns the number of mutations to perform in the population.

crossoverFunc

Specify the function that returns the number of crossover to perform. The default is the length of the niche divided by 2.

crossoverPoints

Specify the active positions for crossover operator. Defaults to a single point in the middle of the chromosome. See Niche object.

offspringScaleFactor

Scale factor for offspring generation. Defaults 1. See Niche object.

offspringMeanFactor

Mean factor for offspring generation. Defaults to 0.85. See Niche object.

offspringPowerFactor

Power factor for offspring generation. Defaults to 2. See Niche object.

elitism

Elitism probability/flag/vector. Defaults to c(1,1,1,1,1,1,1,1,1,0.5) (elitism present for 9 generations followed by a 50% chance, then repeated). See Niche object.

goalFitness

Specify the desired fitness value (fraction of correct classification). Defaults to 0.90. See Galgo object.

galgoVerbose

verbose parameter for Galgo object.

maxGenerations

Maximum number of generations. Defaults to 200. See Galgo object.

minGenerations

Minimum number of generations. Defaults to 10. See Galgo object.

galgoUserData

Additional user data for the Galgo object. See Galgo object.

maxBigBangs

Maximum number of bigbang cycles. Defaults to 1000. See BigBang object.

maxSolutions

Maximum number of solutions collected. Defaults to 1000. See BigBang object.

onlySolutions

Save only when a solution is reach. Defaults to FALSE (to use all the information, then a filter can be used afterwards). See BigBang object.

collectMode

information to collect. Defaults to "bigbang". See BigBang object.

bigbangVerbose

Verbose flag for BigBang object. Defaults to 1. See BigBang object.

saveFile

File name where the data is saved. Defaults to NULL which implies the name is a concatenation of classification.method, method specific parameters, file and ".Rdata". See BigBang object.

saveFrequency

How often the ``current'' solutions are saved. Defaults to 50. See BigBang object.

saveVariable

Internal R variable name of the saved file. Defaults to ``bigbang''. See BigBang object.

callBackFuncGALGO

callBackFunc for Galgo object. See Galgo object.

callBackFuncBB

callBackFunc for BigBang object. See BigBang object.

callEnhancerFunc

callEnhancerFunc for BigBang object. See BigBang object.

saveGeneBreaks

saveGeneBreaks vector for BigBang object. Defaults to NULL which means to be computed automatically (recommended). See BigBang object.

geneNames

The gene (variable) names if they differ from the first column in file or rownames(data).

sampleNames

The sample names if they differ from first row in file or colnames(data).

bigbangUserData

Additional user data for BigBang object (stored in data variable in BigBang object returned).

Value

A ready to use bigbang object.

*** TO DO: EXPLAIN THE STRUCTURE OF "DATA" ***

Details

Wrapper function. Configure all objects from parameters.

Examples

Run this code

# NOT RUN {
bb <- configBB.VarSel(...)
bb
blast(bb)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

Details

See Also

Examples