tune1.subsets(dir.dat, dir.ped, dir.ann, dir.pos.snp, dir.pos.ann,
dir.pos.hap, dir.out, prefix.dat, prefix.ped, prefix.ann, prefix.pos.snp,
prefix.pos.ann, prefix.pos.hap, key.dat = "", key.ann = "",
key.pos.ann = "", key.pos.hap = "", ending.dat = ".dat",
ending.ped = ".ped", ending.ann = ".map", ending.pos.snp = ".snps",
ending.pos.ann = "annotation.txt", ending.pos.hap = ".hap.gz",
pos.list.triple, ped.nonsnp = 5, ann.header=FALSE, ann.snpcol=2,
ann.poscol=4, ann.chrcol=0, pos.ann.header = TRUE, pos.ann.snpcol = 5,
pos.ann.poscol = 2, pos.hap.nonsnp = 2, out.name.subdir = "seg1",
out.prefix = "subdata", rsq.thresh = 0.5, num.iters = 2,
hapmapformat = FALSE, mach.loc = "/software/mach1")
hapmapformat
defaults to FALSE. Another dataset format listing SNPs (.legend.txt) has 4 columns - change hapmapformat
to TRUE.
pre0.dir.create
). The dataset's SNP (.dat) and data (.ped) information are intended to come from d3 (d03_removed); whereas the dataset's position information (.map) can be obtained from d1 (d01_plink) subdirectory. The hapmap files are huge and can be used by many datasets, thus there is no need to keep a copy of them in our subdirectory structure for each dataset. Note: if the hapmap file that specifies SNP information ALSO lists their position information, simply provide that file (and it's column format) to this function twice (as prefix.pos.snp
and prefix.pos.ann
).
This function is meant to begin from early pre-processing steps, re-run MaCH with hapmap on desired regions, then combine CASE with CONTROL, and call all the pre-processing functions in sequence up until pre6.merge.genos
. At the end, the output will be a single file ready to be called by MOSS run1.moss
.
A new convenient subdirectory structure will be created, similar to pre0.dir.create
within new directory out.name.subdir
.
This function requires two sets of data: user's dataset and reference haplotypes. There are many hapmap libraries for download from the web, so this function tries to be as general as possible to allow users to give column information about the format. MaCH also needs to understand the given hapmap format. The defaults are set for 1000G Phase I(a) from MaCH's website: http://www.sph.umich.edu/csg/abecasis/MaCH/download/1000G-PhaseI-Interim.html. Note: the data file (.hap.gz) is expected to be zipped. However please unzip the .annotation.txt file before calling this function.
The first thing this function would do is extract the given position intervals from user's datafiles and from haplotype files. This would make both files smaller so that running MaCH is feasible.
MaCH will be run on CASE and CONTROL data files separately. After MaCH is run with hapmap, most of the predicted SNPs would have very low RSQ score, thus out of thousands of SNPs that are within the region in hapmap file, only hundreds will be actually reliable. This function prunes out all the SNPs with RSQ score lower than rsq.thresh
. Then CASE and CONTROL will be combined based on common remaining SNPs.
Then the function will run the two preprocessing functions (pre5.genos2numeric.batch
, pre6.merge.genos
) to output the final ready-to-use file.
pre2.remove.genos
, pre2.remove.genos.batch
,
pre3.call.mach
, pre4.combine.case.control
,
pre4.combine.case.control.batch
,
pre5.genos2numeric
, pre5.genos2numeric.batch
,
pre6.merge.genos
, run1.moss
print("See the demo 'gendemo'.")
Run the code above in your browser using DataLab