Learn R Programming

SEQMINER2

Table of Contents

Introduction

Seqminer is a highly efficient R-package for retrieving sequence variants from biobank scale datasets of millions of individuals and billions of genetic variants. It supports all variant types, including multi-allelic variants and imputation dosages. It takes VCF/BCF/BGEN/PLINK format as input file, indexes, queries them based upon variant-based index and loads them as R data types such as list or matrix.

Download

Install the development version (devtools package is required):

devtools::install_github("zhanxw/seqminer")

Showcase

Here are some examples of how to use seqminer to index and query files in real-life scenarios.

Index VCF/BCF files

library(seqminer)
bcf.ref.file <- "input.bcf"
bcf.idx.file <- "input.bcf.scIdx"
out <- seqminer::createSingleChromosomeBCFIndex(bcf.ref.file, bcf.idx.file)

or

vcf.ref.file <- "input.vcf.gz"
vcf.idx.file <- "input.vcf.gz.scIdx"
out <- seqminer::createSingleChromosomeVCFIndex(vcf.ref.file, vcf.idx.file)

This would generate variant-based index that works with commonly used sequence variant file format, such as VCF/BCF files.

Query VCF/BCF files

Query VCF file:

vcf.ref.file <-  "input.vcf.gz"
vcf.idx.file <-  "input.vcf.gz.scIdx"
tabix.range <- "1:123-1234"
geno <- seqminer::readSingleChromosomeVCFToMatrixByRange(vcf.ref.file, tabix.range, vcf.idx.file)

Query BCF file:

bcf.ref.file <- "input.bcf"
bcf.idx.file <- "input.bcf.scIdx"
tabix.range <- "1:123-1234"
geno <- seqminer::readSingleChromosomeBCFToMatrixByRange(bcf.ref.file, tabix.range, bcf.idx.file)

Querying multiple regions is also doable, simply specify multiple regions and separte them by a comma, e.g. "1:123-124,1:1234-1235".

Output example (column represents variants, row represents individuals):

Query BGEN/PLINK files

Query BGEN file:

bg.ref.file <- "input.bgen"
bg.range <- "1:123-1234"
geno.mat <- seqminer::readBGENToMatrixByRange(bg.ref.file, bg.range)
geno.list <- seqminer::readBGENToListByRange(bg.ref.file, bg.range)

Make sure that bgen file has an index file *.bgi in the same folder.

Query PLINK file:

plink.ref.file <- "input"
geno <- seqminer::readPlinkToMatrixByIndex(plink.ref.file, sampleIndex=1:20000, markerIndex=1:100)

Command line linterface

We also developed a seqminer command line interface:

./queryVCFIndex.intel input.vcf.gz input.vcf.gz.scIdx 1:123-1234

Citation:

Yang, L., Jiang, S., Jiang, B., Liu, D. J., & Zhan, X. (2020). Seqminer2: An Efficient Tool to Query and Retrieve Genotypes for Statistical Genetics Analyses from Biobank Scale Sequence Dataset. Bioinformatics

Zhan, X. and Liu, D. J. (2015), SEQMINER: An R-Package to Facilitate the Functional Interpretation of Sequence-Based Associations. Genet. Epidemiol., 39: 619–623. doi:10.1002/gepi.21918

Copy Link

Version

Install

install.packages('seqminer')

Monthly Downloads

675

Version

9.7

License

GPL | file LICENSE

Maintainer

Xiaowei Zhan

Last Published

October 2nd, 2024

Functions in seqminer (9.7)

openPlink

Open binary PLINK files
readSingleChromosomeBCFToMatrixByRange

Read a range from BCF file and return a genotype matrix
isDirWritable

Test whether directory is writable
rvmeta.readScoreByRange

Read score test statistics by range from METAL-format files.
readBGENToListByRange

Read information from BGEN file in a given range and return a list
rvmeta.readSkewByRange

Read skew by range from METAL-format files.
readBGENToMatrixByGene

Read a gene from BGEN file and return a genotype matrix
rvmeta.writeScoreData

Write score-based association statistics files.
rvmeta.writeCovData

Write covariance association statistics files.
isTabixRange

Check if the inputs are valid tabix range such as chr1:2-300
readVCFToMatrixByGene

Read a gene from VCF file and return a genotype matrix
rvmeta.readDataByRange

Read association statistics by range from METAL-format files. Both score statistics and covariance statistics will be extracted.
readVCFToMatrixByRange

Read a gene from VCF file and return a genotype matrix
readSingleChromosomeVCFToMatrixByRange

Read a range from VCF file and return a genotype matrix
[.PlinkFile

Read from binary PLINK file and return a genotype matrix
readPlinkToMatrixByIndex

Read from binary PLINK file and return a genotype matrix
validateAnnotationParameter

Validate annotate parameter is valid
tabix.createIndex

Create tabix index file, similar to running tabix in command line.
tabix.read.table

Read tabix file, similar to running tabix in command line.
readBGENToListByGene

Read information from BGEN file in a given range and return a list
rvmeta.readCovByRange

Read covariance by range from METAL-format files.
tabix.createIndex.meta

Create tabix index for bgzipped MetaScore/MetaCov file
rvmeta.readDataByGene

Read association statistics by gene from METAL-format files. Both score statistics and covariance statistics will be extracted.
verifyFilename

validate the inVcf can be created, and outVcf can be write to. will stop if any error occurs
writeWorkflow

Export workflow to Makefile
tabix.read.header

Read tabix file, similar to running tabix in command line.
rvmeta.readNullModel

Read null model statistics
tabix.read

Read tabix file, similar to running tabix in command line.
tabix.createIndex.vcf

Create tabix index for bgzipped VCF file
SEQMINER

Efficiently Read Sequencing Data (VCF format, METAL format) into R
getRefBase

Annotate a test variant
getCovPair

Extract pair of positions by ranges
createSingleChromosomeVCFIndex

Create a single chromosome index
annotateGene

Annotate a test variant
createSingleChromosomeBCFIndex

Create a single chromosome index
annotatePlain

Annotate a plain text file
makeAnnotationParameter

Construct a usable set of annotation parameters
annotateVcf

Annotate a VCF file
isURL

Check if the input is url e.g. http:// or ftp://
download.annotation.resource

Download annotation resources to a directory
newJob

Create a new job
newWorkflow

Create a new workflow
addJob

Add a job to a workflow
isInRange

Test whether a vector of positions are inside given ranges
hasIndex

Check input file has tabix index
readBGENToMatrixByRange

Read a gene from BGEN file and return a genotype matrix
readVCFToListByGene

Read information from VCF file in a given range and return a list
readVCFToListByRange

Read information from VCF file in a given range and return a list