build.primary.metabolome: build.primary.metabolome

Description

utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function to filter a dataset created by 'build.pubchem.bio' function

Usage

build.primary.metabolome(
  pc.directory = NULL,
  get.properties = FALSE,
  threads = 8,
  db.name = "primary.metabolome",
  rcdk.desc = c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"),
  pubchem.bio.object = NULL,
  output.directory = NULL,
  keep.primary.only = TRUE,
  min.tax.ct = 3
)

Value

a data frame containing pubchem CID ('cid'), and lowest common ancestor ('lca') NCBI taxonomy ID integer. will also save to pc.directory as .Rdata file.

Arguments

pc.directory: directory from which to load pubchem .Rdata files
get.properties: logical. if TRUE, will return rcdk calculated properties: XLogP, TPSA, HBondDonorCount and HBondAcceptorCount.
threads: integer. how many threads to use when calculating rcdk properties. parallel processing via DoParallel and foreach packages.
db.name: character. what do you wish the file name for the saved version of this database to be? default = 'primary.metabolome.' Saved as an .Rdata file in the 'pc.directory' location.
rcdk.desc: vector. character vector of valid rcdk descriptors. default = rcdk.desc <- c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"). To see descriptor categories: 'dc <- rcdk::get.desc.categories(); dc' . To see the descriptors within one category: 'dn <- rcdk::get.desc.names(dc[4]); dn'. Note that the four default parameters are relatively fast to calculate - some descriptors take a very long time to calculate. you can calculate as many as you wish, but processing time will increase the more descriptors are added.
pubchem.bio.object: R data.table, generally produced by build.pubchem.bio; preferably, define pc.directory
output.directory: directory to which the pubchem.bio database is saved. If NULL, will try to save in pc.directory (if provided), else not saved.
keep.primary.only: logical. If TRUE, only biological metabolites scored as 'primary' are returned. If FALSE, full dataset of metabolites is returned, with new logical column, 'primary'
min.tax.ct: integer. if assigned an integer value, only those metabolites with at least min.tax.ct unique taxonomy assigments are considered 'primary'. default = 3.

Author

Corey Broeckling data('pubchem.bio', package = "pubchem.bio") my.primary.db <- build.primary.metabolome( pubchem.bio.object = pubchem.bio, get.properties = FALSE, threads = 1) head(my.taxon.db)

Details

utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function