Main proteome retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the proteome of the organisms of interest will be downloaded and stored locally. proteome files can be retrieved from several databases.
getProteomeSet(db = "refseq", organisms, reference = FALSE,
release = NULL, clean_retrieval = TRUE, gunzip = TRUE,
update = FALSE, path = "set_proteomes")
a character string specifying the database from which the proteome shall be retrieved:
db = "refseq"
db = "genbank"
db = "ensembl"
a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism:
by scientific name
: e.g. organism = "Homo sapiens"
by database specific accession identifier
: e.g. organism = "GCF_000001405.37"
(= NCBI RefSeq identifier for Homo sapiens
)
by taxonomic identifier from NCBI Taxonomy
: e.g. organism = "9606"
(= taxid of Homo sapiens
)
a logical value indicating whether or not a proteome shall be downloaded if it isn't marked in the database as either a reference proteome or a representative proteome.
the database release version of ENSEMBL (db = "ensembl"
). Default is release = NULL
meaning
that the most recent database version is used.
logical value indicating whether or not downloaded files shall be renamed for more convenient downstream data analysis.
a logical value indicating whether or not files should be unzipped.
a logical value indicating whether or not files that were already downloaded and are still present in the
output folder shall be updated and re-loaded (update = TRUE
or whether the existing file shall be retained update = FALSE
(Default)).
a character string specifying the location (a folder) in which
the corresponding proteomes shall be stored. Default is
path
= "set_proteomes"
.
File path to downloaded proteomes.
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/
and creates a directory 'set_proteomes' to store the proteomes of interest as fasta files for future processing. In case the corresponding fasta file already exists within the 'set_proteomes' folder and is accessible within the workspace, no download process will be performed.
getGenomeSet
, getCDSSet
,
getRNASet
, getGFFSet
, getCDS
,
getGFF
, getRNA
, meta.retrieval
,
read_proteome
# NOT RUN {
getProteomeSet("refseq", organisms = c("Arabidopsis thaliana",
"Arabidopsis lyrata",
"Capsella rubella"))
# }
# NOT RUN {
# }
Run the code above in your browser using DataCamp Workspace