build.pubchem.bio: build.pubchem.bio

Description

utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function

Usage

build.pubchem.bio(
  pc.directory = NULL,
  use.bio.sources = TRUE,
  bio.sources = c("Metabolomics Workbench", "Human Metabolome Database (HMDB)", "ChEBI",
    "LIPID MAPS", "MassBank of North America (MoNA)"),
  use.pathways = TRUE,
  pathway.sources = NULL,
  use.taxid = TRUE,
  taxonomy.sources = NULL,
  use.parent.cid = TRUE,
  use.parent.when.charged = FALSE,
  remove.salts = TRUE,
  remove.inorganics = FALSE,
  mw.range = c(50, 2000),
  get.properties = TRUE,
  threads = 8,
  rcdk.desc = c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"),
  cid.lca.object = NULL,
  cid.sid.object = NULL,
  cid.pwid.object = NULL,
  cid.parent.object = NULL,
  cid.taxid.object = NULL,
  cid.formula.object = NULL,
  cid.smiles.object = NULL,
  cid.inchikey.object = NULL,
  cid.inchi.object = NULL,
  cid.monoisotopic.mass.object = NULL,
  cid.title.object = NULL,
  cid.cas.object = NULL,
  cid.pmid.ct.object = NULL,
  output.directory = NULL
)

Value

a data frame containing pubchem CID, title, formula, monoisotopic molecular weight, inchikey, smiles, cas, optionally rcdk properties

Arguments

pc.directory: directory from which to load pubchem .Rdata files. alternatively, provide R data.tables for ALL cid.property.object options defined below.
use.bio.sources: logical. If TRUE (default) use the bio.source vector of sources, incorporating all CIDs from those bio databases.
bio.sources: vector of source names from which to extract pubchem CIDs. all can be found here: https://pubchem.ncbi.nlm.nih.gov/sources/, but can additionally use "PubChemLite" as a datasource. defaults to c("Metabolomics Workbench", "Human Metabolome Database (HMDB)", "ChEBI", "LIPID MAPS", "MassBank of North America (MoNA)")
use.pathways: logical. should all CIDs from any biological pathway data be incorporated into database?
pathway.sources: character. vector of sources to be used when adding metabolites to pubchem bio database. default = NULL, using all pathway sources.
use.taxid: logical. should all CIDs associated with a taxonomic identifier (taxid) be used?
taxonomy.sources: character. vector of sources to be used when adding taxonomically related metabolites to database. Default = NULL, using all sources.
use.parent.cid: logical. should CIDs be replaced with parent CIDs? default = TRUE.
use.parent.when.charged: logical. default = FALSE. If TRUE, and use.parent.cid is TRUE, the parent will always be chosen. if use.parent.when.charged = FALSE, and use.parent.cid = TRUE, the neutral molecule will be used, even if that is the child molecule. See CID 1 and CID 2, for an example.
remove.salts: logical. should salts be removed from dataset? default = TRUE. salts recognized as '.' in smiles string. performed after 'use.parent.cid'.
remove.inorganics: logical. should inorganic molecules (those with no carbon) be removed? default = FALSE.
mw.range: vector. numerical vector of length = 2. default = c(50, 2000).
get.properties: logical. if TRUE, will return rcdk calculated properties: XLogP, TPSA, HBondDonorCount and HBondAcceptorCount.
threads: integer. how many threads to use when calculating rcdk properties. parallel processing via DoParallel and foreach packages.
rcdk.desc: vector. character vector of valid rcdk descriptors. default = rcdk.desc <- c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"). To see descriptor categories: 'dc <- rcdk::get.desc.categories(); dc' . To see the descriptors within one category: 'dn <- rcdk::get.desc.names(dc[4]); dn'. Note that the four default parameters are relatively fast to calculate - some descriptors take a very long time to calculate. you can calculate as many as you wish, but processing time will increase the more descriptors are added.
cid.lca.object: R data.table, generally produced by build.cid.lca; preferably, define pc.directory
cid.sid.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
cid.pwid.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
cid.parent.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
cid.taxid.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
cid.formula.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
cid.smiles.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
cid.inchikey.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
cid.inchi.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
cid.monoisotopic.mass.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
cid.title.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
cid.cas.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
cid.pmid.ct.object: R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory
output.directory: directory to which the pubchem.bio database is saved. If NULL, will try to save in pc.directory (if provided), else not saved.

Author

Corey Broeckling

Details

utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function

Examples

Run this code

data('cid.sid', package = "pubchem.bio")
data('cid.pwid', package = "pubchem.bio")
data('cid.parent', package = "pubchem.bio")
data('cid.taxid', package = "pubchem.bio")
data('cid.formula', package = "pubchem.bio")
data('cid.smiles', package = "pubchem.bio")
data('cid.inchikey', package = "pubchem.bio")
data('cid.inchi', package = "pubchem.bio")
data('cid.monoisotopic.mass', package = "pubchem.bio")
data('cid.title', package = "pubchem.bio")
data('cid.cas', package = "pubchem.bio")
data('cid.pmid.ct', package = "pubchem.bio")
data('cid.lca', package = "pubchem.bio")
pc.bio.out <- build.pubchem.bio(use.pathways = FALSE, use.parent.cid = FALSE,
get.properties = FALSE, threads = 1,
cid.sid.object = cid.sid, cid.pwid.object = cid.pwid,
cid.parent.object = cid.parent, cid.taxid.object = cid.taxid,
cid.formula.object = cid.formula, cid.smiles.object = cid.smiles,
cid.inchikey.object = cid.inchikey, cid.inchi.object = cid.inchi,
cid.monoisotopic.mass.object = cid.monoisotopic.mass,
cid.title.object = cid.title, cid.cas.object = cid.cas,
cid.pmid.ct.object = cid.pmid.ct, cid.lca.object = cid.lca)
head(pc.bio.out)

Run the code above in your browser using DataLab