getGenome: Genome Retrieval

Description

This function retrieves a fasta-file storing the genome of an organism of interest and stores the genome file in the folder '_ncbi_downloads/genomes'.

Usage

getGenome(db = "refseq", kingdom, organism, path = file.path("_ncbi_downloads", "genomes"))

Arguments

a character string specifying the database from which the genome shall be retrieved: refseq or genbank. Right now only the ref seq database is included. Later version of biomartr will also allow sequence retrieval from additional databases.

kingdom

a character string specifying the kingdom of the organisms of interest, e.g. "archaea","bacteria", "fungi", "invertebrate", "plant", "protozoa", "vertebrate_mammalian", or "vertebrate_other".

organism

a character string specifying the scientific name of the organism of interest, e.g. 'Arabidopsis thaliana'.

path

a character string specifying the location (a folder) in which the corresponding genome shall be stored. Default is path = file.path("_ncbi_downloads","genomes").

Value

A data.table storing the geneids in the first column and the DNA dequence in the second column.

Details

Internally this function loads the the overview.txt file from NCBI:

refseq: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/ and creates a directory '_ncbi_downloads/genomes' to store the genome of interest as fasta file for future processing. In case the corresponding fasta file already exists within the '_ncbi_downloads/genomes' folder and is accessible within the workspace, no download process will be performed.

References

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq

ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank

http://www.ncbi.nlm.nih.gov/refseq/about/

Examples

Run this code

## Not run: 
# 
# # download the genome of Arabidopsis thaliana from refseq
# # and store the corresponding genome file in '_ncbi_downloads/genomes'
# getGenome( db       = "refseq", 
#            kingdom  = "plant", 
#            organism = "Arabidopsis thaliana", 
#            path = file.path("_ncbi_downloads","genomes"))
# 
# file_path <- file.path("_ncbi_downloads","genomes","Arabidopsis_thaliana_genomic.fna.gz")
# Ath_genome <- read_genome(file_path, format = "fasta")
# 
# 
# # download the genome of Arabidopsis thaliana from genbank
# # and store the corresponding genome file in '_ncbi_downloads/genomes'
# getGenome( db       = "genbank", 
#            kingdom  = "plant", 
#            organism = "Arabidopsis thaliana", 
#            path = file.path("_ncbi_downloads","genomes"))
# 
# file_path <- file.path("_ncbi_downloads","genomes","Arabidopsis_thaliana_genomic.fna.gz")
# Ath_genome <- read_genome(file_path, format = "fasta")
# ## End(Not run)

Run the code above in your browser using DataLab

State of Data and AI Literacy Report 2025