getCDS: Coding Sequence Retrieval

Description

This function retrieves a fasta-file storing the CDS files of the genome of an organism of interest and stores this file in the folder '_ncbi_downloads/CDS'.

Usage

getCDS(db = "refseq", kingdom, organism, path = file.path("_ncbi_downloads",
  "CDS"), delete_corrupt = FALSE)

Arguments

a character string specifying the database from which the CDS file shall be retrieved: 'refseq'. Right now only the ref seq database is included. Later version of biomartr will also allow sequence retrieval from additional databases.

kingdom

a character string specifying the kingdom of the organisms of interest, e.g. "archaea","bacteria", "fungi", "invertebrate", "plant", "protozoa", "vertebrate_mammalian", or "vertebrate_other".

organism

a character string specifying the scientific name of the organism of interest, e.g. 'Arabidopsis thaliana'.

path

a character string specifying the location (a folder) in which the corresponding CDS file shall be stored. Default is path = file.path("_ncbi_downloads","CDS").

delete_corrupt

a logical value specifying whether potential CDS sequences that cannot be divided by 3 shall be be excluded from the the dataset. Default is delete_corrupt = FALSE.

Value

A data.table storing the geneids in the first column and the DNA dequence in the second column.

Details

Internally this function loads the the overview.txt file from NCBI:

refseq: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/

and creates a directory '_ncbi_downloads/CDS' to store the genome of interest as CDS fasta file for future processing. In case the corresponding fasta file already exists within the '_ncbi_downloads/CDS' folder and is accessible within the workspace, no download process will be performed. So the folder can delete when the corresponding CDS file shall be downloaded again.

References

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq

http://www.ncbi.nlm.nih.gov/refseq/about/

Examples

Run this code

# download the genome of Arabidopsis thaliana from refseq
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
getCDS( db       = "refseq",
        kingdom  = "plant",
        organism = "Arabidopsis thaliana",
        path     = file.path("_ncbi_downloads","CDS"))


file_path <- file.path("_ncbi_downloads","CDS","Arabidopsis_thaliana_rna.fna.gz")
Ath_CDS <- read_cds(file_path, format = "fasta")

Run the code above in your browser using DataLab