QSPRpred-class: QSPRpred class

Description

Quantitative Structure-Properties Relationship (QSPR) model construction. This class contains all the required functions to train linear and non-linear models, to produce bootstrap datasets for variance estimation, and to provide prediction capabilities over a matrix or vector of studied properties.

Arguments

smis

is a list of vectors of SMILES from which a regression model will be trained, or for which targeted properties will be predicted.

prop

is a list of vectors/matrices of available targeted physico-chemical properties for the training dataset.

v_filterfunc

defines the filtering function (NULL by default) to use in the computation of properties to filter.

v_filtermin

is a vector representing the expected minimal value for each filtered property.

v_filtermax

is a vector representing the expected maximal value for each filtered property.

v_fnames

is a vector, or a list of vectors, of fingerprints and/or physical descriptors types used as features for each regression model (see get_descriptor for an exhaustive list of available descriptors).

v_scale

sets (FALSE by default) the scaling of physical descriptors only (i.e. continuous features) - mean = 0, standard deviation = 1.

v_func

defines the analytic function (NULL by default), or a list of analytic functions, to use in the computation of a subsequent property, or properties respectively. A given function will return a new property computed analytically via a list of known properties in prop. This is particularly useful when data and regression models can be stated for some properties (e.g. A and B), but not for a targeted property of interest (e.g. A+B, A/B, etc.) for which constrains are defined via the set_target method.

v_func_args

is a vector, or a list of vectors, of integers that tags the used properties in prop for the computation of a subsequent property. For example, v_func=list(func1,func2), where func1 and func2 are a priori defined functions, and prop=list(V1,M23), where V1 is a numerical vector and M23 is a two columns matrix. In this case, v_func_args=list(c(1,3),c(2)), i.e. the function func1 uses the 1st and 3rd output properties located in prop, and func2 uses the 2nd only. Therefore, the defined empirical functions know where to find their inputs.

kekulise

enables (FALSE by default) electron checking and allows for parsing of incorrect SMILES (see parse.smiles).

model

is the name of a regression model to be used (see get_Models for an exhaustive list).

params

is a list of parameters to submit to a given regression model (see get_Model_params for examples).

n_boot

is the number of requested bootstrap datasets (1 by default) in the training process. This is used for an estimation of the means and standard deviations of subsequent non-Bayesian predictions. A higher number of bootstrap datasets will allow more accuracy in this estimation. However, it exists a trade-off between accuracy and computation time that the user has to figure out. Consequently, in order to ease the bootstrap analysis, a parallelization capability is implemented.

s_boot

is the proportion of input data (0.85 by default), defined in ]0,1], used to construct bootstrap datasets.

r_boot

allows (FALSE by default) the sampling in a bootstrap analysis to be performed with replacement.

parallelize

allows (FALSE by default) to use the full computational capability of a user's machine for a bootstrap analysis. Indeed, N-1 cores, with N the total number of cores available on the machine, will be used.

v_propmin

is a vector representing the expected minimal value for each targeted property.

v_propmax

is a vector representing the expected maximal value for each targeted property.

temp

is a vector/matrix of numerical values which sets the initial temperatures in the annealing process for the sequential Monte-Carlo sampler (see vignette("tutorial", package = "iqspr") for details).

Fields

propndim: is the number of properties received as input data.

propmin

is a vector representing the expected minimal value for each targeted property.

propmax

is a vector representing the expected maximal value for each targeted property.

filtermin

is a vector representing the expected minimal value for each filtered property.

filtermax

is a vector representing the expected maximal value for each filtered property.

filterfunc

is a function to compute the properties to filter.

X

is the nxd matrix, with d features for n input SMILES, returned by get_descriptor.

Y

is a nxp matrix of p properties for n input SMILES.

fnames

is a list of vectors of fingerprints and/or physical descriptors types used as features in each regression model by get_descriptor.

mdesc

is a scalar or vector of means used for physical descriptors scaling, returned by get_descriptor.

sddesc

is a scalar or vector of standard deviations used for physical descriptors scaling, returned by get_descriptor.

scale

tags the scaling statement (TRUE or FALSE) of the physical descriptors only (i.e. continuous features) - mean = 0, standard deviation = 1.

func

defines the analytic function to use in the computation of a subsequent property.

func_args

is a vector of integers that tags the used columns in the property array prop for the computation of a subsequent property.

trmodel

is the name of the used regression model for training and predictions.

trnboot

is the number of bootstrap dataset used for the training.

trndf

is the number of input SMILES, i.e. the number of degrees of freedom, available in the training of the regression process.

Methods

get_features(): returns a list of nxd matrix X with d features for n input SMILES

get_props()

returns a list of nxp matrix Y of p properties for n input SMILES

init_env(smis = NULL, prop = matrix(0), v_filterfunc = NULL,
  v_filtermin = NULL, v_filtermax = NULL, v_fnames = NULL,
  v_scale = FALSE, v_func = NULL, v_func_args = NULL, kekulise = F)

initialize the QSPR predictor: implicitly called via the QSPRpred$new() method

iqspr_predict(smis = NULL, temp = c(1, 1))

predicts properties for input SMILES from a given regression model and evaluates the probability to reach a targeted properties space

model_training(model = "linear_Bayes", params = NA, n_boot = 10,
  s_boot = 0.85, r_boot = F, parallelize = F)

allows to train regression models, define their parameters, request bootstrap approach and CPU parallelization

qspr_predict(smis = NULL)

predicts properties for input SMILES from a given regression model

set_target(v_propmin, v_propmax)

sets the targeted properties space in vectors propmin and propmax

Examples

Run this code

# NOT RUN {
# Load pre-existing data
data(qspr.data)
# Define input SMILES
smis <- paste(qspr.data[,1])
# Define associated properties
prop <- qspr.data[,c(2,5)]
# Define training set
trainidx <- sample(1:nrow(qspr.data), 5000)
# Initialize the prediction environment
# and compute fingerprints/descriptors associated to input SMILES
qsprpred_env <- QSPRpred()
qsprpred_env$initenv(smis=smis[trainidx], prop=as.matrix(prop[trainidx,]), v_fnames="graph")
# Train a regression model with associated parameters,
# number of bootstrapped datasets without CPUs parallelization
qsprpred_env$model_training(model="elasticnet",params=list("alpha" = 0.5),n_boot=10,parallelize=F)

# Predict properties for a test set
predictions <- qsprpred_env$qspr_predict(smis[-trainidx])
# Plot the results
par(mfrow=c(1,2))
plot(predictions[[1]][1,], prop[-trainidx,1], xlab="prediction", ylab="true")
segments(-100,-100,1000,1000,col=2,lwd=2)
plot(predictions[[1]][2,], prop[-trainidx,2], xlab="prediction", ylab="true")
segments(-100,-100,1000,1000,col=2,lwd=2)

# Set a targeted properties space
qsprpred_env$set_target(c(8,100),c(9,200))
# Predict properties for any input SMILES
# and their probability to be close to the targeted properties space
inv_pred <- qsprpred_env$qspr_predict(smis = smis[-trainidx], temp=c(3,3))

See \code{vignette("tutorial", package = "iqspr")} for further options and details.

# }
# NOT RUN {
# }

Run the code above in your browser using DataLab