getCDS: Coding Sequence Retrieval

Description

This function retrieves a fasta-file storing the CDS files of the genome of an organism of interest and stores this file in the folder '_ncbi_downloads/CDS'.

Usage

getCDS(db = "refseq", kingdom, organism, path = file.path("_ncbi_downloads", "CDS"), delete_corrupt = FALSE)

Arguments

a character string specifying the database from which the CDS file shall be retrieved: refseq. Right now only the ref seq database is included. Later version of biomartr will also allow sequence retrieval from additional databases.

kingdom

a character string specifying the kingdom of the organisms of interest, e.g. "archaea","bacteria", "fungi", "invertebrate", "plant", "protozoa", "vertebrate_mammalian", or "vertebrate_other".

organism

a character string specifying the scientific name of the organism of interest, e.g. 'Arabidopsis thaliana'.

path

a character string specifying the location (a folder) in which the corresponding CDS file shall be stored. Default is path = file.path("_ncbi_downloads","CDS").

delete_corrupt

a logical value specifying whether potential CDS sequences that cannot be divided by 3 shall be be excluded from the the dataset. Default is delete_corrupt = FALSE.

Value

A data.table storing the geneids in the first column and the DNA dequence in the second column.

Details

Internally this function loads the the overview.txt file from NCBI:

refseq: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/

and creates a directory '_ncbi_downloads/CDS' to store the genome of interest as CDS fasta file for future processing. In case the corresponding fasta file already exists within the '_ncbi_downloads/CDS' folder and is accessible within the workspace, no download process will be performed. So the folder can delete when the corresponding CDS file shall be downloaded again.

References

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq http://www.ncbi.nlm.nih.gov/refseq/about/

Examples

Run this code

## Not run: 
# 
# # download the genome of Arabidopsis thaliana from refseq
# # and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
# getCDS( db       = "refseq", 
#         kingdom  = "plant", 
#         organism = "Arabidopsis thaliana", 
#         path     = file.path("_ncbi_downloads","CDS"))
# 
# 
# file_path <- file.path("_ncbi_downloads","CDS","Arabidopsis_thaliana_rna.fna.gz")
# Ath_CDS <- read_cds(file_path, format = "fasta")
# 
# ## End(Not run)

Run the code above in your browser using DataLab