build.taxon.metabolome: build.taxon.metabolome

Description

utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function to filter a dataset created by 'build.pubchem.bio' function

Usage

build.taxon.metabolome(
  pc.directory = NULL,
  taxid = c(),
  get.properties = FALSE,
  full.scored = TRUE,
  keep.scored.only = FALSE,
  aggregation.function = max,
  threads = 8,
  db.name = "custom.metabolome",
  rcdk.desc = c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"),
  pubchem.bio.object = NULL,
  cid.lca.object = NULL,
  taxid.hierarchy.object = NULL,
  output.directory = NULL
)

Value

a data frame containing pubchem CID ('cid'), and lowest common ancestor ('lca') NCBI taxonomy ID integer. will also save to pc.directory as .Rdata file.

Arguments

pc.directory: directory from which to load pubchem .Rdata files
taxid: integer vector of integer NCBI taxonomy IDs. i.e. c(9606, 1425170 ) for Homo sapiens and Homo heidelbergensis.
get.properties: logical. if TRUE, will return rcdk calculated properties: XLogP, TPSA, HBondDonorCount and HBondAcceptorCount.
full.scored: logincal. default = FALSE. When false, only metabolites which map to the taxid(s) are returned. When TRUE, all metabolites are returned, with scores assigned based on the distance of non-mapped metabolites to the root node. i.e. specialized metabolites from distantly related species are going to be scored at or near zero, specialized metabolites of mores similar species higher, and more conserved metabolites will score higher than ore specialized.
keep.scored.only: logical. If TRUE, biological metabolites with NA for the taxonomy score are removed before returning.
aggregation.function: function. default = max. can use mean, median, min, etc, or a custom function. Defines how the aggregate score will be calculated when multiple taxids are used.
threads: integer. how many threads to use when calculating rcdk properties. parallel processing via DoParallel and foreach packages.
db.name: character. what do you wish the file name for the saved version of this database to be? default = 'custom.metabolome', but could be 'taxid.4071' or 'Streptomyces', etc. Saved as an .Rdata file in the 'pc.directory' location.
rcdk.desc: vector. character vector of valid rcdk descriptors. default = rcdk.desc <- c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"). To see descriptor categories: 'dc <- rcdk::get.desc.categories(); dc' . To see the descriptors within one category: 'dn <- rcdk::get.desc.names(dc[4]); dn'. Note that the four default parameters are relatively fast to calculate - some descriptors take a very long time to calculate. you can calculate as many as you wish, but processing time will increase the more descriptors are added.
pubchem.bio.object: R data.table, generally produced by build.pubchem.bio; preferably, define pc.directory
cid.lca.object: R data.table, generally produced by build.cid.lca; preferably, define pc.directory
taxid.hierarchy.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
output.directory: directory to which the pubchem.bio database is saved. If NULL, will try to save in pc.directory (if provided), else not saved.

Author

Corey Broeckling

Details

utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function

Examples

Run this code

data('cid.lca', package = "pubchem.bio")
data('pubchem.bio', package = "pubchem.bio")
data('taxid.hierarchy', package = "pubchem.bio")
my.taxon.db <- build.taxon.metabolome(
pubchem.bio.object = pubchem.bio,
cid.lca.object = cid.lca, taxid.hierarchy.object = taxid.hierarchy,
get.properties = FALSE, threads = 1, taxid = c(1))
head(my.taxon.db)

Run the code above in your browser using DataLab