Learn R Programming

Locally query GenBank

NOTE: Starting with v2.0.0, the database backend changed from MonetDBLite to duckdb. Because of this change, restez v2.0.0 or higher is not compatible with databases built with previous versions of restez.

Download parts of NCBI’s GenBank to a local folder and create a simple SQL-like database. Use ‘get’ tools to query the database by accession IDs. rentrez wrappers are available, so that if sequences are not available locally they can be searched for online through Entrez.

See the detailed tutorials for more information.

Introduction

Vous entrez, vous rentrez et, maintenant, vous …. restez!

Downloading sequences and sequence information from GenBank and related NCBI taxonomic databases is often performed via the NCBI API, Entrez. Entrez, however, has a limit on the number of requests and downloading large amounts of sequence data in this way can be inefficient. For programmatic situations where multiple Entrez calls are made, downloading may take days, weeks or even months.

This package aims to make sequence retrieval more efficient by allowing a user to download large sections of the GenBank database to their local machine and query this local database either through package specific functions or Entrez wrappers. This process is more efficient as GenBank downloads are made via NCBI’s FTP using compressed sequence files. With a good internet connection and a middle-of-the-road computer, a database comprising 20 GB of sequence information can be generated in less than 10 minutes.

Installation

Install from CRAN:

install.packages("restez")

Or install the development version from r-universe:

install.packages("restez", repos = "https://ropensci.r-universe.dev")

Or install the development version from GitHub (requires installing the remotes package first):

# install.packages("remotes")
remotes::install_github("ropensci/restez")

Quick Examples

For more detailed information on the package’s functions and detailed guides on downloading, constructing and querying a database, see the detailed tutorials.

Setup

# Warning: running these examples may take a few minutes
library(restez)
# choose a location to store GenBank files
restez_path_set(rstz_pth)
# Run the download function
db_download()
# after download, create the local database
db_create()

Query

# for reproducibility
set.seed(12345)
# get a random accession ID from the database
id <- sample(list_db_ids(), 1)
#> Warning in list_db_ids(): Number of ids returned was limited to [100].
#> Set `n=NULL` to return all ids.
# you can extract:
# sequences
seq <- gb_sequence_get(id)[[1]]
str(seq)
#>  chr "ACCGTTTTGACAGGTAACGTGAAAGCTCTTGGCAACGGGTCTTGATACCGAGTCGGGATCGGTAGTTGTTGCTTTGTTCGTTCACGATTTAAGGTCAACCTTAGCCTTGAGTTTTTCCAAGTAGT"
# definitions
def <- gb_definition_get(id)[[1]]
print(def)
#> [1] "Unidentified RNA clone M33.7"
# organisms
org <- gb_organism_get(id)[[1]]
print(org)
#> [1] "unidentified"
# or whole records
rec <- gb_record_get(id)[[1]]
cat(rec)
#> LOCUS       AF040767                 125 bp    RNA     linear   UNA 06-MAR-1998
#> DEFINITION  Unidentified RNA clone M33.7.
#> ACCESSION   AF040767
#> VERSION     AF040767.1
#> KEYWORDS    .
#> SOURCE      unidentified
#>   ORGANISM  unidentified
#>             unclassified sequences.
#> REFERENCE   1  (bases 1 to 125)
#>   AUTHORS   Pan,W.S., Ji,X.Y., Wang,H.T., Tian,K.G. and Yu,X.L.
#>   TITLE     RNA from plasma of Rhesus monkey(NO.33) which was infected by a
#>             certain patient's serum
#>   JOURNAL   Unpublished
#> REFERENCE   2  (bases 1 to 125)
#>   AUTHORS   Pan,W.S., Ji,X.Y., Wang,H.T., Tian,K.G. and Yu,X.L.
#>   TITLE     Direct Submission
#>   JOURNAL   Submitted (31-DEC-1997) Department of Applied Molecular Biology,
#>             Microbiology & Epidemiology Institution, 20 Dongdajie Street,
#>             Fengtai, Beijing 100071, China
#> FEATURES             Location/Qualifiers
#>      source          1..125
#>                      /organism="unidentified"
#>                      /mol_type="genomic RNA"
#>                      /db_xref="taxon:32644"
#>                      /clone="M33.7"
#>                      /note="from the plasma of Rhesus monkey which was infected
#>                      by plasma of a human patient"
#> ORIGIN      
#>         1 accgttttga caggtaacgt gaaagctctt ggcaacgggt cttgataccg agtcgggatc
#>        61 ggtagttgtt gctttgttcg ttcacgattt aaggtcaacc ttagccttga gtttttccaa
#>       121 gtagt
#> //

Entrez wrappers

# use the entrez_* wrappers to access GB data
res <- entrez_fetch(db = 'nucleotide', id = id, rettype = 'fasta')
cat(res)
#> >AF040767.1 Unidentified RNA clone M33.7
#> ACCGTTTTGACAGGTAACGTGAAAGCTCTTGGCAACGGGTCTTGATACCGAGTCGGGATCGGTAGTTGTT
#> GCTTTGTTCGTTCACGATTTAAGGTCAACCTTAGCCTTGAGTTTTTCCAAGTAGT
# if the id is not in the local database
# these wrappers will search online via the rentrez package
res <- entrez_fetch(db = 'nucleotide', id = c('S71333.1', id),
                    rettype = 'fasta')
#> [1] id(s) are unavailable locally, searching online.
cat(res)
#> >AF040767.1 Unidentified RNA clone M33.7
#> ACCGTTTTGACAGGTAACGTGAAAGCTCTTGGCAACGGGTCTTGATACCGAGTCGGGATCGGTAGTTGTT
#> GCTTTGTTCGTTCACGATTTAAGGTCAACCTTAGCCTTGAGTTTTTCCAAGTAGT
#> 
#> >S71333.1 alpha 1,3 galactosyltransferase [New World monkeys, mermoset lymphoid cell line B95.8, mRNA Partial, 1131 nt]
#> ATGAATGTCAAAGGAAAAGTAATTCTGTCGATGCTGGTTGTCTCAACTGTGATTGTTGTGTTTTGGGAAT
#> ATATCAACAGCCCAGAAGGCTCTTTCTTGTGGATATATCACTCAAAGAACCCAGAAGTTGATGACAGCAG
#> TGCTCAGAAGGACTGGTGGTTTCCTGGCTGGTTTAACAATGGGATCCACAATTATCAACAAGAGGAAGAA
#> GACACAGACAAAGAAAAAGGAAGAGAGGAGGAACAAAAAAAGGAAGATGACACAACAGAGCTTCGGCTAT
#> GGGACTGGTTTAATCCAAAGAAACGCCCAGAGGTTATGACAGTGACCCAATGGAAGGCGCCGGTTGTGTG
#> GGAAGGCACTTACAACAAAGCCATCCTAGAAAATTATTATGCCAAACAGAAAATTACCGTGGGGTTGACG
#> GTTTTTGCTATTGGAAGATATATTGAGCATTACTTGGAGGAGTTCGTAACATCTGCTAATAGGTACTTCA
#> TGGTCGGCCACAAAGTCATATTTTATGTCATGGTGGATGATGTCTCCAAGGCGCCGTTTATAGAGCTGGG
#> TCCTCTGCGTTCCTTCAAAGTGTTTGAGGTCAAGCCAGAGAAGAGGTGGCAAGACATCAGCATGATGCGT
#> ATGAAGACCATCGGGGAGCACATCTTGGCCCACATCCAACACGAGGTTGACTTCCTCTTCTGCATGGATG
#> TGGACCAGGTCTTCCAAGACCATTTTGGGGTAGAGACCCTGGGCCAGTCGGTGGCTCAGCTACAGGCCTG
#> GTGGTACAAGGCAGATCCTGATGACTTTACCTATGAGAGGCGGAAAGAGTCGGCAGCATATATTCCATTT
#> GGCCAGGGGGATTTTTATTACCATGCAGCCATTTTTGGAGGAACACCGATTCAGGTTCTCAACATCACCC
#> AGGAGTGCTTTAAGGGAATCCTCCTGGACAAGAAAAATGACATAGAAGCCGAGTGGCATGATGAAAGCCA
#> CCTAAACAAGTATTTCCTTCTCAACAAACCCTCTAAAATCTTATCTCCAGAATACTGCTGGGATTATCAT
#> ATAGGCCTGCCTTCAGATATTAAAACTGTCAAGCTATCATGGCAAACAAAAGAGTATAATTTGGTTAGAA
#> AGAATGTCTGA

Contributing

Want to contribute? Check the contributing page.

Licence

MIT

Citation

Bennett et al. (2018). restez: Create and Query a Local Copy of GenBank in R. Journal of Open Source Software, 3(31), 1102. https://doi.org/10.21105/joss.01102

References

Benson, D. A., Karsch-Mizrachi, I., Clark, K., Lipman, D. J., Ostell, J., & Sayers, E. W. (2012). GenBank. Nucleic Acids Research, 40(Database issue), D48–D53. DOI 10.1093/nar/gkr1202

Winter DJ. (2017) rentrez: An R package for the NCBI eUtils API. PeerJ Preprints 5:e3179v2 https://doi.org/10.7287/peerj.preprints.3179v2

Maintainer

Joel Nitta

This package previously developed and maintained by Dom Bennett


Copy Link

Version

Install

install.packages('restez')

Monthly Downloads

288

Version

2.1.5

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Joel Nitta

Last Published

March 7th, 2025

Functions in restez (2.1.5)

dwnld_path_get

Get dwnld path
extract_inforecpart

Extract the information record part
extract_keywords

Extract keywords
dir_size

Calculate the size of a directory
demo_db_create

Create demo database
extract_accession

Extract accession
entrez_gb_get

Get Entrez GenBank record
extract_by_patterns

Extract by keyword
extract_clean_sequence

Extract clean sequence from sequence part
db_sqlngths_log

Log the min and max sequence lengths
extract_definition

Extract definition
dwnld_rcrd_log

Log a downloaded file in the restez path
extract_features

Extract features
extract_version

Extract version
extract_organism

Extract organism
filename_log

Write filenames to log files
flatfile_read

Read flatfile sequence records
extract_locus

Extract locus
gbrelease_get

Get the GenBank release number in the restez path
gbrelease_check

Check if the last GenBank release number is the latest
gb_build

Read and add .seq files to database
gb_definition_get

Get definition from GenBank
extract_seqrecpart

Extract the sequence record part
gb_df_create

Create GenBank data.frame
gb_df_generate

Generate GenBank records data.frame
file_download

Download a file
last_entry_get

Return the last entry
gb_extract

Extract elements of a GenBank record
gb_fasta_get

Get fasta from GenBank
last_dwnld_get

Return date and time of the last download
last_add_get

Return date and time of the last added sequence
gb_sequence_get

Get sequence from GenBank
gb_sql_add

Add to GenBank SQL database
gbrelease_log

Log the GenBank release number in the restez path
extract_sequence

Extract sequence
gb_sql_query

Query the GenBank SQL
latest_genbank_release

Retrieve latest GenBank release number
ncbi_acc_get

Get accession numbers by querying NCBI GenBank
mock_seq

Mock seq
mock_rec

Mock rec
is_in_db

Is in db
identify_downloadable_files

Identify downloadable files
gb_version_get

Get version from GenBank
gb_record_get

Get record from GenBank
latest_genbank_release_notes

Download the latest GenBank Release Notes
has_data

Does the connected database have data?
list_db_ids

List database IDs
gb_organism_get

Get organism from GenBank
message_missing

Produce message of missing IDs
restez_connect

Connect to the restez database
restez_path_set

Set restez path
restez_path_unset

Unset restez path
restez_disconnect

Disconnect from restez database
sql_path_get

Get SQL path
stat

Print blue
mock_gb_df_generate

Generate mock GenBank records data.frame
mock_org

Mock org
restez_ready

Is restez ready?
restez_rl

Restez readline
seshinfo_log

Log the system session information in restez path
setup

Set up test common test data
mock_def

Mock def
restez_path_check

Check restez filepath
restez_path_get

Get restez path
predict_datasizes

Print file size predictions to screen
slctn_log

Log the GenBank selection made by a user
slctn_get

Retrieve GenBank selections made by user
print.status

Print method for status class
readme_log

Create README in restez_path
restez

restez: Create and Query a Local Copy of GenBank in R
restez_status

Check restez status
record

Example GenBank record
search_gz

Scan a gzipped file for text
status_class

Generate a list class for storing status information
testdatadir_get

Get test data directory
db_delete

Delete database
cleanup

Clean up test data
count_db_ids

Return the number of ids
connection_get

Retrieve restez connection
connected

Is restez connected?
char

Print green
db_create

Create new NCBI database
add_rcrd_log

Log files added to the SQL database in the restez path
check_connection

Helper function to test if a stable internet connection can be established.
cat_line

Cat lines
entrez_fetch

Entrez fetch
db_sqlngths_get

Return the minimum and maximum sequence lengths in db
db_download

Download database
entrez_fasta_get

Get Entrez fasta
db_download_intern

Download database (internal version)