prefix.file
, contain a keyword key.file
, and end with ending.file
, in dir.file
into TRAIN and TEST files, based on the percentage train.percent
- how many percent of the data should go into TRAIN file.
pre8.split.train.test.batch(dir.file, dir.out, prefix.file, key.file = "",
ending.file = ".txt", train.percent = 80, separ = "\t", index.prefix = "index",
file.has.ext = TRUE, resample = FALSE)
train.percent=80
, then 800 entries will appear in file.name
to separate entries.
dir.out
(if it has been created by previous runs of this program).
file.name
has a filename extension (ex. ".txt", ".ped", ".mlgeno").
index.prefix
will be saved in the dir.out
directory for the given train.percent
. This file will contain indices that correspond to entries taken into the TRAIN file. If resample
=FALSE, then all subsequent runs of this function on other files (for example for different chromosomes on the same dataset) with the same train.percent
will use that saved file. This is to make sure that the same individuals go into TRAIN file, across all chromosomes. If resample
=TRUE, then new random resampling will take place and new index file will be generated and saved to the dir.out
directory; note, in this case the entries generated by this file will no longer correspond to entries generated by previous runs for previous index files; so for consistency, re-run all chromosomes with resample flag set to FALSE.
dir.file
satisfying the naming criterion of prefix.file
, key.file
, and ending.file
, split each of these files into TRAIN and TEST files, based on the percentage train.percent
- how many percent of the data should go into TRAIN file.The input files are expected to have last column represent CASE and CONTROL; this is necessary to make sure that train.percent
of CASE and train.percent
of CONTROL entries go into TRAIN file, to have even sample of both types of entries. If the data is saved in many files (for example one file per chromosome), this function is designed to first randomly sample the individuals for the TRAIN file for the first file it is run on. Then it uses this sampling for all other chromosomes on subsequent runs (if resample=FALSE), such that individuals in TRAIN file correspond to one another across all chromosome files (same holds for TEST files). The index file is also useful for processing familyl .fam file after the data has been split.
The following files will be output:
-.train. . - the output TRAIN file containing train.percent percent of the original data; will appear in dir.out directory. * here is the file name without extension; * is the extension part of (i.e. the section that follows the last "." symbol) * is specifying the percentage that was used to generate the file. - .test. . - the entries for TEST file, containing the remaining (100 - train.percent) data. Similar to the TRAIN file above. - . .txt - the file containing indicies of the entries corresponding to TRAIN file, this file will be generated if it does not already exist in dir.out, or if resample=TRUE.
pre6.merge.genos
, pre7.add.conf.var
,
pre8.split.train.test
, run1.moss
print("See the demo 'gendemo'.")
Run the code above in your browser using DataLab