biomartr
Genomic Data Retrieval with R
Motivation:
This package is born out of my own frustration to automate the genomic data retrieval process to create computationally reproducible scripts for large-scale genomics studies. Since I couldn't find easy-to-use and fully reproducible software libraries that would allow others and me to write transparent and easy to reproduce code, I sat down and tried to implement a framework that would enable anyone to automate the genomic data retrieval process. Personally, I strongly support and believe in reproducible research, and I truly hope that this package might be useful to others as well and that it helps to promote reproducible research in genomics studies.
I happily welcome anyone who wishes to contribute to this project :) Just drop me an email.
Short package description:
The vastly growing number of sequenced genomes allows us to perform a new type of biological research. Using a comparative approach these genomes provide us with new insights on how biological information is encoded on the molecular level and how this information changes over evolutionary time.
The first step, however, of any genome based study is to retrieve genomes from databases. To automate the
retrieval process on a meta-genomic scale, the biomartr
package provides useful interface functions for genomic sequence retrieval and functional annotation retrieval. The major aim of biomartr
is to facilitate computational reproducibility and large-scale handling of genomic data for (meta-)genomic analyses.
In detail, biomartr
aims to provide users with an easy to use framework to obtain genome, proteome, CDS, GFF (annotation), genome assembly quality, and metagenome project data. Furthermore, an interface to the Ensembl Biomart database allows users to retrieve functional annotation for genomic loci.
Users can download entire databases such as NCBI RefSeq
, NCBI nr
, NCBI nt
, NCBI Genbank
, etc. as well as ENSEMBL
and ENSEMBLGENOMES
with only one command.
Hence, the biomartr
package is designed to achieve the highest degree of computational reproducibility in genomics research.
Citation
Please cite the following paper when using biomartr
for your own research. This will allow me to continue
working on this software tool and will motivate me to extend its functionality and usability in the next years. Many thanks in advance :)
Drost HG, Paszkowski J. Biomartr: genomic data retrieval with R. Bioinformatics (2017) 33(8): 1216-1217. doi:10.1093/bioinformatics/btw821.
Platforms
Find
biomartr
also at OmicTools.
Frequently Asked Questions (FAQs)
Please find all FAQs here.
Discussions and Bug Reports
I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.
Furthermore, in case you find some bugs or need additional (more flexible) functionality of parts of this package, please let me know:
For Bug Reports: Please send me an issue.
Tutorials
Getting Started with biomartr
:
- Introduction
- Database Retrieval
- Genomic Sequence Retrieval
- Meta-Genome Retrieval
- Functional Annotation
- BioMart Examples
Users can also read the tutorials within (RStudio) :
# source the biomartr package
library(biomartr)
# look for all tutorials (vignettes) available in the biomartr package
# this will open your web browser
browseVignettes("biomartr")
Installation
# install biomartr 0.5.1
source("http://bioconductor.org/biocLite.R")
biocLite('biomartr')
Install Developer Version
Some bug fixes or new functionality will not be available on CRAN yet, but in the developer version here on GitHub. To download and install the most recent version of biomartr
run:
# install the current version of biomartr on your system
source("http://bioconductor.org/biocLite.R")
biocLite("HajkD/biomartr")
NEWS
The current status of the package as well as a detailed history of the functionality of each version of biomartr
can be found in the NEWS section.
Genomic Data Retrieval
Meta-Genome Retrieval
meta.retrieval()
: Perform Meta-Genome Retieval from NCBI of species belonging to the same kingdom of life or to the same taxonomic subgroupmeta.retrieval.all()
: Perform Meta-Genome Retieval from NCBI of the entire kingdom of lifegetMetaGenomes()
: Retrieve metagenomes from NCBI GenbankgetMetaGenomeAnnotations()
: Retrieve annotation *.gff files for metagenomes from NCBI GenbanklistMetaGenomes()
: List available metagenomes on NCBI GenbankgetMetaGenomeSummary()
: Helper function to retrieve the assembly_summary.txt file from NCBI genbank metagenomes
Genome Retrieval
listGenomes()
: List all genomes available on NCBI and ENSEMBL serverslistKingdoms()
: list the number of available species per kingdom of life on NCBI and ENSEMBL serverslistGroups()
: list the number of available species per group on NCBI and ENSEMBL serversgetKingdoms()
: Retrieve available kingdoms of lifegetGroups()
: Retrieve available groups for a kingdom of lifeis.genome.available()
: Check Genome Availability NCBI and ENSEMBL serversgetGenome()
: Download a specific genome stored on NCBI and ENSEMBL serversgetProteome()
: Download a specific proteome stored on NCBI and ENSEMBL serversgetCDS()
: Download a specific CDS file (genome) stored on NCBI and ENSEMBL serversgetRNA()
: Download a specific RNA file stored on NCBI and ENSEMBL serversgetGFF()
: Genome Annotation Retrieval from NCBI (*.gff
) and ENSEMBL (*.gff3
) serversgetGTF()
: Genome Annotation Retrieval (*.gtf
) from ENSEMBL serversgetRepeatMasker() :
Repeat Masker TE Annotation RetrievalgetAssemblyStats()
: Genome Assembly Stats Retrieval from NCBIgetKingdomAssemblySummary()
: Helper function to retrieve the assembly_summary.txt files from NCBI for all kingdomsgetMetaGenomeSummary()
: Helper function to retrieve the assembly_summary.txt files from NCBI genbank metagenomesgetSummaryFile()
: Helper function to retrieve the assembly_summary.txt file from NCBI for a specific kingdomgetENSEMBLInfo()
: Retrieve ENSEMBL info filegetGENOMEREPORT()
: Retrieve GENOME_REPORTS file from NCBI
Import Downloaded Files
read_genome()
: Import genomes as Biostrings or data.table objectread_proteome()
: Import proteome as Biostrings or data.table objectread_cds()
: Import CDS as Biostrings or data.table objectread_gff()
: Import GFF fileread_rna()
: Import RNA fileread_rm()
: Import Repeat Masker output fileread_assemblystats()
: Import Genome Assembly Stats File
Database Retrieval
listNCBIDatabases()
: Retrieve a List of Available NCBI Databases for Downloaddownload.database()
: Download a NCBI database to your local hard drivedownload.database.all()
: Download a complete NCBI Database such as e.g.NCBI nr
to your local hard drive
BioMart Queries
biomart()
: Main function to query the BioMart databasegetMarts()
: Retrieve All Available BioMart DatabasesgetDatasets()
: Retrieve All Available Datasets for a BioMart DatabasegetAttributes()
: Retrieve All Available Attributes for a Specific DatasetgetFilters()
: Retrieve All Available Filters for a Specific DatasetorganismBM()
: Function for organism specific retrieval of available BioMart marts and datasetsorganismAttributes()
: Function for organism specific retrieval of available BioMart attributesorganismFilters()
: Function for organism specific retrieval of available BioMart filters
Performing Gene Ontology queries
Gene Ontology
getGO()
: Function to retrieve GO terms for a given set of genes
Download Developer Version On Windows Systems
# On Windows, this won't work - see ?build_github_devtools
install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)
# When working with Windows, first you need to install the
# R package: rtools -> install.packages("rtools")
# Afterwards you can install devtools -> install.packages("devtools")
# and then you can run:
devtools::install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)
# and then call it from the library
library("biomartr", lib.loc = "C:/Program Files/R/R-3.1.1/library")
Troubleshooting on Windows Machines
- Install
biomartr
on a Win 8 laptop: solution ( Thanks to Andres Romanowski )
Code of conduct
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.