Learn R Programming

pubchem.bio (version 1.0.5)

build.pubchem.bio: build.pubchem.bio

Description

utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function

Usage

build.pubchem.bio(
  pc.directory = NULL,
  use.bio.sources = TRUE,
  bio.sources = c("Metabolomics Workbench", "Human Metabolome Database (HMDB)", "ChEBI",
    "LIPID MAPS", "MassBank of North America (MoNA)"),
  use.pathways = TRUE,
  pathway.sources = NULL,
  use.taxid = TRUE,
  taxonomy.sources = NULL,
  use.parent.cid = TRUE,
  use.parent.when.charged = FALSE,
  remove.salts = TRUE,
  remove.inorganics = FALSE,
  mw.range = c(50, 2000),
  get.properties = TRUE,
  threads = 8,
  rcdk.desc = c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor",
    "org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"),
  cid.lca.object = NULL,
  cid.sid.object = NULL,
  cid.pwid.object = NULL,
  cid.parent.object = NULL,
  cid.taxid.object = NULL,
  cid.formula.object = NULL,
  cid.smiles.object = NULL,
  cid.inchikey.object = NULL,
  cid.inchi.object = NULL,
  cid.monoisotopic.mass.object = NULL,
  cid.title.object = NULL,
  cid.cas.object = NULL,
  cid.pmid.ct.object = NULL,
  output.directory = NULL
)

Value

a data frame containing pubchem CID, title, formula, monoisotopic molecular weight, inchikey, smiles, cas, optionally rcdk properties

Arguments

pc.directory

directory from which to load pubchem .Rdata files. alternatively, provide R data.tables for ALL cid.property.object options defined below.

use.bio.sources

logical. If TRUE (default) use the bio.source vector of sources, incorporating all CIDs from those bio databases.

bio.sources

vector of source names from which to extract pubchem CIDs. all can be found here: https://pubchem.ncbi.nlm.nih.gov/sources/, but can additionally use "PubChemLite" as a datasource. defaults to c("Metabolomics Workbench", "Human Metabolome Database (HMDB)", "ChEBI", "LIPID MAPS", "MassBank of North America (MoNA)")

use.pathways

logical. should all CIDs from any biological pathway data be incorporated into database?

pathway.sources

character. vector of sources to be used when adding metabolites to pubchem bio database. default = NULL, using all pathway sources.

use.taxid

logical. should all CIDs associated with a taxonomic identifier (taxid) be used?

taxonomy.sources

character. vector of sources to be used when adding taxonomically related metabolites to database. Default = NULL, using all sources.

use.parent.cid

logical. should CIDs be replaced with parent CIDs? default = TRUE.

use.parent.when.charged

logical. default = FALSE. If TRUE, and use.parent.cid is TRUE, the parent will always be chosen. if use.parent.when.charged = FALSE, and use.parent.cid = TRUE, the neutral molecule will be used, even if that is the child molecule. See CID 1 and CID 2, for an example.

remove.salts

logical. should salts be removed from dataset? default = TRUE. salts recognized as '.' in smiles string. performed after 'use.parent.cid'.

remove.inorganics

logical. should inorganic molecules (those with no carbon) be removed? default = FALSE.

mw.range

vector. numerical vector of length = 2. default = c(50, 2000).

get.properties

logical. if TRUE, will return rcdk calculated properties: XLogP, TPSA, HBondDonorCount and HBondAcceptorCount.

threads

integer. how many threads to use when calculating rcdk properties. parallel processing via DoParallel and foreach packages.

rcdk.desc

vector. character vector of valid rcdk descriptors. default = rcdk.desc <- c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"). To see descriptor categories: 'dc <- rcdk::get.desc.categories(); dc' . To see the descriptors within one category: 'dn <- rcdk::get.desc.names(dc[4]); dn'. Note that the four default parameters are relatively fast to calculate - some descriptors take a very long time to calculate. you can calculate as many as you wish, but processing time will increase the more descriptors are added.

cid.lca.object

R data.table, generally produced by build.cid.lca; preferably, define pc.directory

cid.sid.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

cid.pwid.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

cid.parent.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

cid.taxid.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

cid.formula.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

cid.smiles.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

cid.inchikey.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

cid.inchi.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

cid.monoisotopic.mass.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

cid.title.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

cid.cas.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

cid.pmid.ct.object

R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory

output.directory

directory to which the pubchem.bio database is saved. If NULL, will try to save in pc.directory (if provided), else not saved.

Author

Corey Broeckling

Details

utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function

Examples

Run this code
data('cid.sid', package = "pubchem.bio")
data('cid.pwid', package = "pubchem.bio")
data('cid.parent', package = "pubchem.bio")
data('cid.taxid', package = "pubchem.bio")
data('cid.formula', package = "pubchem.bio")
data('cid.smiles', package = "pubchem.bio")
data('cid.inchikey', package = "pubchem.bio")
data('cid.inchi', package = "pubchem.bio")
data('cid.monoisotopic.mass', package = "pubchem.bio")
data('cid.title', package = "pubchem.bio")
data('cid.cas', package = "pubchem.bio")
data('cid.pmid.ct', package = "pubchem.bio")
data('cid.lca', package = "pubchem.bio")
pc.bio.out <- build.pubchem.bio(use.pathways = FALSE, use.parent.cid = FALSE,
get.properties = FALSE, threads = 1,
cid.sid.object = cid.sid, cid.pwid.object = cid.pwid,
cid.parent.object = cid.parent, cid.taxid.object = cid.taxid,
cid.formula.object = cid.formula, cid.smiles.object = cid.smiles,
cid.inchikey.object = cid.inchikey, cid.inchi.object = cid.inchi,
cid.monoisotopic.mass.object = cid.monoisotopic.mass,
cid.title.object = cid.title, cid.cas.object = cid.cas,
cid.pmid.ct.object = cid.pmid.ct, cid.lca.object = cid.lca)
head(pc.bio.out)

Run the code above in your browser using DataLab