Learn R Programming

⚠️There's a newer version (9.7) of this package.Take me there.

SEQMINER2

Table of Contents

Introduction

Seqminer is a highly efficient R-package for retrieving sequence variants from biobank scale datasets of millions of individuals and billions of genetic variants. It supports all variant types, including multi-allelic variants and imputation dosages. It takes VCF/BCF/BGEN/PLINK format as input file, indexes, queries them based upon variant-based index and loads them as R data types such as list or matrix.

Download

Install the development version (devtools package is required):

devtools::install_github("zhanxw/seqminer")

Showcase

Here are some examples of how to use seqminer to index and query files in real-life scenarios.

Index VCF/BCF files

library(seqminer)
bcf.ref.file <- "input.bcf"
bcf.idx.file <- "input.bcf.scIdx"
out <- seqminer::createSingleChromosomeBCFIndex(bcf.ref.file, bcf.idx.file)

or

vcf.ref.file <- "input.vcf.gz"
vcf.idx.file <- "input.vcf.gz.scIdx"
out <- seqminer::createSingleChromosomeVCFIndex(vcf.ref.file, vcf.idx.file)

This would generate variant-based index that works with commonly used sequence variant file format, such as VCF/BCF files.

Query VCF/BCF files

Query VCF file:

vcf.ref.file <-  "input.vcf.gz"
vcf.idx.file <-  "input.vcf.gz.scIdx"
tabix.range <- "1:123-1234"
geno <- seqminer::readSingleChromosomeVCFToMatrixByRange(vcf.ref.file, tabix.range, vcf.idx.file)

Query BCF file:

bcf.ref.file <- "input.bcf"
bcf.idx.file <- "input.bcf.scIdx"
tabix.range <- "1:123-1234"
geno <- seqminer::readSingleChromosomeBCFToMatrixByRange(bcf.ref.file, tabix.range, bcf.idx.file)

Querying multiple regions is also doable, simply specify multiple regions and separte them by a comma, e.g. "1:123-124,1:1234-1235".

Output example (column represents variants, row represents individuals):

Query BGEN/PLINK files

Query BGEN file:

bg.ref.file <- "input.bgen"
bg.range <- "1:123-1234"
geno.mat <- seqminer::readBGENToMatrixByRange(bg.ref.file, bg.range)
geno.list <- seqminer::readBGENToListByRange(bg.ref.file, bg.range)

Make sure that bgen file has an index file *.bgi in the same folder.

Query PLINK file:

plink.ref.file <- "input"
geno <- seqminer::readPlinkToMatrixByIndex(plink.ref.file, sampleIndex=1:20000, markerIndex=1:100)

Command line linterface

We also developed a seqminer command line interface:

./queryVCFIndex.intel input.vcf.gz input.vcf.gz.scIdx 1:123-1234

Citation:

Yang, L., Jiang, S., Jiang, B., Liu, D. J., & Zhan, X. (2020). Seqminer2: An Efficient Tool to Query and Retrieve Genotypes for Statistical Genetics Analyses from Biobank Scale Sequence Dataset. Bioinformatics

Zhan, X. and Liu, D. J. (2015), SEQMINER: An R-Package to Facilitate the Functional Interpretation of Sequence-Based Associations. Genet. Epidemiol., 39: 619–623. doi:10.1002/gepi.21918

Copy Link

Version

Install

install.packages('seqminer')

Monthly Downloads

744

Version

9.4

License

GPL | file LICENSE

Maintainer

Xiaowei Zhan

Last Published

February 3rd, 2024

Functions in seqminer (9.4)

openPlink

Open binary PLINK files
isDirWritable

Test whether directory is writable
hasIndex

Check input file has tabix index
readBGENToMatrixByRange

Read a gene from BGEN file and return a genotype matrix
[.PlinkFile

Read from binary PLINK file and return a genotype matrix
tabix.createIndex

Create tabix index file, similar to running tabix in command line.
readPlinkToMatrixByIndex

Read from binary PLINK file and return a genotype matrix
readBGENToListByRange

Read information from BGEN file in a given range and return a list
annotatePlain

Annotate a plain text file
createSingleChromosomeVCFIndex

Create a single chromosome index
download.annotation.resource

Download annotation resources to a directory
rvmeta.readDataByGene

Read association statistics by gene from METAL-format files. Both score statistics and covariance statistics will be extracted.
makeAnnotationParameter

Construct a usable set of annotation parameters
isURL

Check if the input is url e.g. http:// or ftp://
createSingleChromosomeBCFIndex

Create a single chromosome index
readVCFToMatrixByGene

Read a gene from VCF file and return a genotype matrix
readBGENToListByGene

Read information from BGEN file in a given range and return a list
readSingleChromosomeBCFToMatrixByRange

Read a range from BCF file and return a genotype matrix
newJob

Create a new job
readSingleChromosomeVCFToMatrixByRange

Read a range from VCF file and return a genotype matrix
readBGENToMatrixByGene

Read a gene from BGEN file and return a genotype matrix
newWorkflow

Create a new workflow
readVCFToMatrixByRange

Read a gene from VCF file and return a genotype matrix
readVCFToListByGene

Read information from VCF file in a given range and return a list
rvmeta.readDataByRange

Read association statistics by range from METAL-format files. Both score statistics and covariance statistics will be extracted.
rvmeta.writeCovData

Write covariance association statistics files.
rvmeta.readNullModel

Read null model statistics
rvmeta.readCovByRange

Read covariance by range from METAL-format files.
verifyFilename

validate the inVcf can be created, and outVcf can be write to. will stop if any error occurs
rvmeta.writeScoreData

Write score-based association statistics files.
tabix.createIndex.meta

Create tabix index for bgzipped MetaScore/MetaCov file
tabix.read

Read tabix file, similar to running tabix in command line.
tabix.createIndex.vcf

Create tabix index for bgzipped VCF file
readVCFToListByRange

Read information from VCF file in a given range and return a list
rvmeta.readScoreByRange

Read score test statistics by range from METAL-format files.
writeWorkflow

Export workflow to Makefile
tabix.read.header

Read tabix file, similar to running tabix in command line.
isInRange

Test whether a vector of positions are inside given ranges
tabix.read.table

Read tabix file, similar to running tabix in command line.
isTabixRange

Check if the inputs are valid tabix range such as chr1:2-300
validateAnnotationParameter

Validate annotate parameter is valid
rvmeta.readSkewByRange

Read skew by range from METAL-format files.
addJob

Add a job to a workflow
annotateVcf

Annotate a VCF file
getCovPair

Extract pair of positions by ranges
annotateGene

Annotate a test variant
getRefBase

Annotate a test variant
SEQMINER

Efficiently Read Sequencing Data (VCF format, METAL format) into R