pegas (version 0.14)

read.vcf: Read Variant Calling Format Files

Description

This function reads allelic data from VCF (variant calling format) files.

Usage

read.vcf(file, from = 1, to = 10000, which.loci = NULL, quiet = FALSE)

Arguments

file

a file name specified by either a variable of mode character, or a quoted string.

from, to

the loci to read; by default, the first 10,000.

which.loci

an alternative way to specify which loci to read is to give their indices (see link{VCFloci} how to obtain them).

quiet

a logical: should the progress of the operation be printed?

Value

an object of class c("loci", "data.frame").

Details

The VCF file can be compressed (*.gz) or not. Since pegas 0.11, compressed remote files can be read (see examples).

A TABIX file is not required (and will be ignored if present).

In the VCF standard, missing data are represented by a dot and these are read ``as is'' by the present function without trying to substitute by NA.

References

https://www.internationalgenome.org/wiki/Analysis/vcf4.0

https://github.com/samtools/hts-specs

See Also

VCFloci, read.loci, read.gtx, write.loci

Examples

Run this code
# NOT RUN {
## Chr Y from the 1000 Genomes:
a <- "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502"
b <- "ALL.chrY.phase3_integrated_v1b.20130502.genotypes.vcf.gz"
## WARNING: the name of the file above may change
url <- paste(a, b, sep = "/")
## Solution 1: download first
download.file(url, "chrY.vcf.gz")
## no need to uncompress:
(info <- VCFloci("chrY.vcf.gz"))
str(info) # show the modes of the columns
## Solution 2: read remotely (since pegas 0.11)
info2 <- VCFloci(url)
identical(info, info2)
rm(info2)

SNP <- is.snp(info)
table(SNP) # how many loci are SNPs?
## compare with:
table(getINFO(info, "VT"))

op <- par(mfcol = c(4, 1), xpd = TRUE)
lim <- c(2.65e6, 2.95e6)
## distribution of SNP and non-SNP mutations along the Y chr:
plot(info$POS, !SNP, "h", col = "red", main = "non-SNP mutations",
     xlab = "Position", ylab = "", yaxt = "n")
rect(lim[1], -0.1, lim[2], 1.1, lwd = 2, lty = 2)
plot(info$POS, SNP, "h", col = "blue", main = "SNP mutations",
     xlab = "Position", ylab = "", yaxt = "n")
rect(lim[1], -0.1, lim[2], 1.1, lwd = 2, lty = 2)
par(xpd = FALSE)
## same focusing on a smaller portion of the chromosome:
plot(info$POS, !SNP, "h", col = "red", xlim = lim, xlab = "Position",
     ylab = "", yaxt = "n")
plot(info$POS, SNP, "h", col = "blue", xlim = lim, xlab = "Position",
     ylab = "", yaxt = "n")
par(op)

## read both types of mutations separately:
X.SNP <- read.vcf("chrY.vcf.gz", which.loci = which(SNP))
X.other <- read.vcf("chrY.vcf.gz", which.loci = which(!SNP))

identical(rownames(X.SNP), VCFlabels("chrY.vcf.gz")) # TRUE
cat(VCFheader("chrY.vcf.gz"))

## get haplotypes for the first 10 loci:
h <- haplotype(X.SNP, 1:10)
## plot their frequencies:
op <- par(mar = c(3, 10, 1, 1))
plot(h, horiz=TRUE, las = 1)
par(op)
# }

Run the code above in your browser using DataLab