pre3.call.mach: Call MaCH imputation with and without Hapmap

Description

Calls MACH1 program on file.ped and file.dat. MaCH1 can be run in 2 different ways: 1. with Hapmap, and 2. without Hapmap. NOTE: In this implementation, do NOT run "with Hapmap".

This program first runs MaCH1 on file.ped with Hapmap to fill in missing values for those SNPs that exist in the reference file; and then MaCH1 is run on the result without Hapmap to fill in all the remaining missing values. If no reference files ref.phase and ref.legend are provided, then the program runs MaCH1 without Hapmap only. To clean up any weird MaCH output, use genos.clean or pre5.genos2numeric.

Usage

pre3.call.mach(file.dat, file.ped, dir.file, ref.phase = "", ref.legend = "", 
dir.ref = "", dir.out, out.prefix = "result", chrom.num = "", num.iters = 2, 
num.subjects = 200, step2.subjects = 50, empty = "0/0", resample = FALSE, 
mach.loc = "/software/mach1")

Arguments

file.dat

The name of data file as required for MaCH1. The file should be of the format:

M SNP1 M SNP2

- Space separated - No header - Column 1: consists of "M" - Column 2: character SNP names

file.ped

The name of pedegree data file in MaCH1 input format.

p1 p1 0 0 1 C/C N/N T/C ... p2 p2 0 0 1 T/T A/C G/G ... ...

- Tab separated - Alleles are separated by slash '/' (IMPORTANT!) - No header - 5 non-SNP leading columns - Col 1: sample/patient ID: some unique ID - Col 2: family ID: can be same as patient ID - Col 3 and Col 4: parents: mother/father: can all be 0 - Col 5: gender, 1-male, 2-female - Col 6+: geno information, slash separator between alleles.

dir.file

The name of directory where file.ped can be found.

ref.phase

The name of the reference file, must have no missing values, can be obtained from websites like: http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2007-08_rel22/phased/ or similar/updated versions. No zip. Must be a normal and readable by R file.

ref.legend

The name of legend file for file.phase, obtained from same website. No zip.

dir.ref

The name of directory where ref.phase and ref.legend can be found.

dir.out

The name of directory where MaCH1 output should go.

out.prefix

The prefix for naming output files that MaCH1 should use. If num.subjects > 0 then the num.subjects will be appended to the prefix name.

chrom.num

The optional string denoting the chromosome number, for better naming of intermediate files.

num.iters

The number of iterations MaCH1 should make in its first step to estimate its model parameters. The same number will be used for parameter estimation when using Hapmap and when NO Hapmap is used.

num.subjects

How many individuals from the sample should be used for model building by the first step of MaCH1. The random subset of inidividuals will be extracted by this program. Recommended number of subjects is 200-500. Value

step2.subjects

How many individuals should be processed at a time during the second step of MaCH computation. Value <= 0="" 2="" will="" use="" all="" the="" subjects="" in="" dataset.="" this="" variable="" is="" important="" to="" reduce="" exponential="" computation="" time="" required by="" mach="" when="" number="" of="" individuals="" too="" large.="" however="" if="" low,="" second="" step="" might="" not="" get="" enough="" samples,="" thus="" making="" weird="" prediction="" '2'="" instead="" an="" allele="" value.="" '2's,="" try="" set="" step2.subjects="" a="" larger="" remove="" snps="" that="" have="" predicted="" for="" any="" its="" entries,="" genos.clean or pre5.genos2numeric.

empty

The way a missing/empty entry of SNP is represented in file.ped.

resample

Whether or not to overwrite the existing file containing the num.subjects entries produced by previous runs of this algorithm with same file.dat, file.ped and num.subjects parameters. By default, if the subjects have been sampled before, they are re-used.

mach.loc

The location directory where "mach" executable can be found.

Details

It is recommended to avoid using Hapmap functionality in this implementation.

The MaCH1 algorithm requires 2 steps to be performed. The first step of MaCH1 will be run on num.subjects randomly chosen from the set. The file with randomly chosen individuals will be saved as file.ped..ped in dir.file directory. If the file already exists for this num.subjects, the old file will be used if resample=F. If resample=T then old files will be ignored, and new sampling will take place. The step1 of MaCH will only be run if resample=T, or if the files that MaCH1 produces do not exist yet. Thus if step1 runs well, but step2 crashes, re-calling this function will not waste time on re-running step1 over again.

The second step without Hapmap takes exponentially long wrt number of subjects processed. Thus the second step will be run on bunches of subjects, step2.subjects at a time.

A subdirectory structure for debugging will be formed in dir.out, the directory will be named 'working'.

Two output files will be produced in dir.out: the .ped file that will not have any missing values, will be named <out.prefix><chrom.num>.mlgeno, and a .dat file (same as before).

References

MaCH website: http://www.sph.umich.edu/csg/abecasis/MACH/download/

Examples

Run this code

print("See the demo 'gendemo'.")

Run the code above in your browser using DataLab