pegas (version 1.3)

read.vcf: Read Variant Calling Format Files

Description

read.vcf reads allelic data from VCF (variant calling format) files.

write.vcf writes allelic data from an object of class "loci" into a VCF file.

Usage

read.vcf(file, from = 1, to = 10000, which.loci = NULL, quiet = FALSE)
write.vcf(x, file, CHROM = NULL, POS = NULL, quiet = FALSE)

Value

an object of class c("loci", "data.frame").

Arguments

file

a file name specified by either a variable of mode character, or a quoted string.

from, to

the loci to read; by default, the first 10,000.

which.loci

an alternative way to specify which loci to read is to give their indices (see link{VCFloci} how to obtain them).

quiet

a logical: should the progress of the operation be printed?

x

an object of class "loci".

CHROM, POS

two vectors giving the chromosomes and (genomic) positions of the loci (typically from the output of VCFloci).

Author

Emmanuel Paradis

Details

The VCF file can be compressed (*.gz) or not. Since pegas 0.11, compressed remote files can be read (see examples).

A TABIX file is not required (and will be ignored if present).

In the VCF standard, missing data are represented by a dot and these are read ``as is'' by the present function without trying to substitute by NA.

References

https://www.internationalgenome.org/wiki/Analysis/vcf4.0

https://github.com/samtools/hts-specs

See Also

VCFloci, read.loci, read.gtx, write.loci

Examples

Run this code
if (FALSE) {
## Chr Y from the 1000 Genomes:
a <- "https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502"
b <- "ALL.chrY.phase3_integrated_v1b.20130502.genotypes.vcf.gz"
## WARNING: the name of the file above may change
url <- paste(a, b, sep = "/")
## Solution 1: download first
download.file(url, "chrY.vcf.gz")
## no need to uncompress:
(info <- VCFloci("chrY.vcf.gz"))
str(info) # show the modes of the columns
## Solution 2: read remotely (since pegas 0.11)
info2 <- VCFloci(url)
identical(info, info2)
rm(info2)

SNP <- is.snp(info)
table(SNP) # how many loci are SNPs?
## compare with:
table(getINFO(info, "VT"))

op <- par(mfcol = c(4, 1), xpd = TRUE)
lim <- c(2.65e6, 2.95e6)
## distribution of SNP and non-SNP mutations along the Y chr:
plot(info$POS, !SNP, "h", col = "red", main = "non-SNP mutations",
     xlab = "Position", ylab = "", yaxt = "n")
rect(lim[1], -0.1, lim[2], 1.1, lwd = 2, lty = 2)
plot(info$POS, SNP, "h", col = "blue", main = "SNP mutations",
     xlab = "Position", ylab = "", yaxt = "n")
rect(lim[1], -0.1, lim[2], 1.1, lwd = 2, lty = 2)
par(xpd = FALSE)
## same focusing on a smaller portion of the chromosome:
plot(info$POS, !SNP, "h", col = "red", xlim = lim, xlab = "Position",
     ylab = "", yaxt = "n")
plot(info$POS, SNP, "h", col = "blue", xlim = lim, xlab = "Position",
     ylab = "", yaxt = "n")
par(op)

## read both types of mutations separately:
X.SNP <- read.vcf("chrY.vcf.gz", which.loci = which(SNP))
X.other <- read.vcf("chrY.vcf.gz", which.loci = which(!SNP))

identical(rownames(X.SNP), VCFlabels("chrY.vcf.gz")) # TRUE
cat(VCFheader("chrY.vcf.gz"))

## get haplotypes for the first 10 loci:
h <- haplotype(X.SNP, 1:10)
## plot their frequencies:
op <- par(mar = c(3, 10, 1, 1))
plot(h, horiz=TRUE, las = 1)
par(op)
}

Run the code above in your browser using DataLab