The code corresponding to the format of the data file ("gt" for
genotypes, "gp" for genotype probabilities, "gl" for genotype likelihoods in
Phred scores, "ad" for allelic depths). For all these formats, markers are
ordered per rows and individuals per columns. Variants should be ordered by
chromosome and position. By default, the first five columns are chromosome
identification (e.g, "1", "chr1"), the name of the marker, the position of
the marker in base pairs or better in cM multiplied by 1,000,000 when genetic
distances are known, the first marker allele and the second marker allele.
Information per individual varies according to the format. With the "gt"
format we have one column per individual with 0, 1 and 2 indicating the
number of copies of the first allele (and 9 for missing). With the "gp" format we
have three column per individual with the probabilities of genotype 11
(homozygous for the first allele), genotype 12 and genotype 22 (this
corresponds to the oxford GEN format). Similarly, with the "gl" format, we
have three column per individual with the likelihoods for genotypes 11, 12
and 22 in Phred scores. Finally, with the "ad" format, we expect two columns
per individual: the number of reads for allele 1 and the number of reads
for allele 2. For these three last formats, missing values must be indicated
by setting all elements to 0. If one of the columns is non-null for one
individual, the genotype will be considered non-missing. Note that the
marker alleles specified in columns 4 and 5 are not used.
Conversion of a PLINK ped file or a VCF file to RZooRoH format can easily be
performed using PLINK (version 1.9) or using bcftools.
For ped files, recode them to oxford gen format with plink --file myinput
--recode oxford --autosome --out myoutput. The autosome option keeps only
SNPs on autosomes as required by RZooRoH.
For vcf files, bcftools can be used to recode a vcf to the oxford gen format
with the convert option: bcftools convert -t ^chrX,chrY,chrM -g outfile
--chrom --tag GT myfile.vcf. The --chrom option is important to obtain
chromosome number in the first column. The tag option allows to select which
field from the vcf file (GT, PL, GL or GP) is used to generate the genotype
probabilities exported in the oxford gen format. The -t option allows to
exclude chromosomes (this is an example and chromosome names must be adapted
if necessary). The needed output data is then outfile.gen.
If some genotype probabilities are missing, with a value of "-nan", you must
replace them with "0" (triple 0 is considered as missing). This can be done
with this command:
sed -e 's/-nan/0/g' file.gen > newfile.gen