refine: Refine estimates iteratively

Description

refine is a generic function with methods for objects of the classes produced by MSL. In the up-to-date workflow, it can automatically (1) define new parameters points, (2) add simulations to the reference table for these points, (3) optionally recompute projections, (4) update the inference of the likelihood surface, and (5) provides new point estimates, confidence intervals, and other results of an MSL call. It can repeat these steps iteratively as controlled by its workflow_design. Although it has many control arguments, few of them may be needed in any application. In particular it is designed to use reasonable default controls for the number of iterations, the number of points added in each iteration, and whether to update projections or not, when given only the current fit object as input.

reproject and recluster are wrappers for refine(..., ntot=0L), updating the object after either recomputing the projections or only re-performing the multivariate gaussian mixture clustering.

Usage

# S3 method for SLik
refine(object, method=NULL, ...)


# S3 method for default
refine(
    object, 
    ##       reference table simulations  
    Simulate = attr(surfaceData,"Simulate"),
    control.Simulate = attr(surfaceData,"control.Simulate"),
    newsimuls = NULL,
    ##       CIs
    CIs = workflow_design$reftable_sizes[useCI], 
    useCI = prod(dim(object$logLs))

Value

refine returns an updated SLik or SLik_j object, unless both newsimuls and Simulate arguments are NULL, in which case a data frame of parameter points is returned.

Arguments

object: an SLik or SLik_j object

## reference table simulations

Simulate

Character string: name of the function used to simulate samples. As it is typically stored in the object this argument does not need to be explicitly given; otherwise this should be the same function provided to add_reftable, whose documentation details the design requirements. The only meaningful non-default value is NULL, in which case refine may return (if newsimuls is also NULL) a data frame of parameter points on which to run a simulation function.

control.Simulate

A list of arguments of the Simulate function (see add_simulation). The default value should be used unless you understand enough of its structure to modify it wisely (e.g., it may contain the path of an executable on one machine and a different path may be specified to refine a fit on another machine).

newsimuls

For the SLik_j method, a matrix or data frame, with the same parameters and summary statistics as the data of the original infer_SLik_joint call.

For other methods, a list of simulation of distributions of summary statistics, in the same format as for link{add_simulation}. If no such list is provided (i.e., if newsimuls remains NULL), the function extracted by get_from(object,"Simulate") is used (it is inherited from the Simulate argument of add_simulation through the initial sequence of calls of functions add_simulation, infer_logLs or infer_tailp, and infer_surface). If no such function is available, then this function returns parameters for which new distribution should be provided by the user.

## CIs

CIs

Boolean, or boolean vector, or numeric (preferably integer) vector: controls to infer bounds of (one-dimensional, profile) confidence intervals. The numeric vector form allows to specify reference table size(s) for which CIs should be computed when these sizes are first reached. TRUE or FALSE will force or inhibit computation in all iterations. Finally (and probably less useful), a boolean vector such as CIs=c(TRUE,FALSE,TRUE) requests computation of CIs when the number of points cumulatively added reaches the target number of points for the first, third, and any subsequent iterations up to maxit (this may differ in certain cases from the first, third, and so on, iterations: see Details).

The default for refine is described in the Details. The default for reproject is to update the CIs if there are computed ones within the input object.

useCI

whether to perform RMSE computations for inferred confidence interval points.

level

Intended coverage of confidence intervals

## workflow design

workflow_design

A list structured as the return value of get_workflow_design. The default value makes reference to elements of the input object's colTypes element.

maxit

Maximum number of iterative refinements (see also precision argument).

ntot

NULL or numeric: control of the total number of simulated samples (one for each new parameter point) to be added to the reference table over the maxit iterations. See Details for the rules used to determine the number of points added in each iteration. Reasonable default values are defined for ntot and maxit (see Details), so that beginners (and ideally, even more advanced users) do not have to find good values.

ntot=0L may be used to re-generate the projectors or the clustering without augmenting the reference table.

n

NULL or numeric, for a number of parameter points (excluding replicates and confidence interval points in the primitive workflow), whose likelihood should be computed in each iteration (see n argument of sample_volume). Slightly less intutive alternative to ntot specification, as there is at least one iteration where the actual number of added points is not the nominal n (see Details). n=0L will have the same effect as ntot=0L.

## termination conditions

precision: Requested local precision of surface estimation, in terms of prediction standard errors (RMSEs) of both the maximum summary log-likelihood and the likelihood ratio at any CI bound available. Iterations will stop when either maxit is reached, or if the RMSEs have been computed for the object (see eval_RMSEs argument) and this precision is reached for the RMSEs. A given precision on the CI bounds themselves might seem more interesting, but is not well specified by a single precision parameter if the parameters are on widely different scales.
eval_RMSEs: Same usage as for CIs; controls the eval_RMSEs argument of MSL in each iteration. See Details for the default. The default for reproject is to update the RMSEs if there are computed ones within the input object.

## verbosity

verbose: A list as shown by the default, or simply a vector of booleans. verbose$most controls whether to display information about progress and results, except plots; $final controls whether to plot() the final object to show the final likelihood surface. Default is to plot it only in an interactive session and if fewer than three parameters are estimated; $movie controls whether to plot() the updated object in each iteration; verbose$proj controls the verbose argument of project.character; verbose$rparam controls (cryptic) information about generation of new parameter points; verbose$progress_bars controls display of some progress bars. If verbose is an unnamed vector of booleans, they are interpreted as as-many first elements of theverbose vector, in the order shown by the default.

## projection controls

update_projectors: Same usage as for CIs; this controls in which iterations the projectors are updated. The default NULL value is strongly recommended. See Details for further explanations.
methodArgs: A list of arguments for the projection method. By default the methodArgs of the original project.character calls are reused over iteration, but elements of the new methodArgs list will be used to update the original methodArgs. Note that the updated list becomes the new default for further iterations.

## Likelihood surface modeling

using: Passed to infer_SLik_joint: a character string used to control the joint-density estimation method, as documented for that function (see method instead for equivalent control in primitive workflow). Default is to use to same method as in the the first iteration, but this argument allows a change of method.
nbCluster: Passed to infer_SLik_joint. The data in the expression for the default value refers to the data argument of the latter function.

## parallelisation

cluster_args: A list of arguments for makeCluster, in addition to makeCluster's spec argument which is in most cases best specified by the nb_cores argument. Cluster arguments allow independent control of parallel computations for the different steps of a refine iteration (see Details; as a rough but effective summary, use only nb_cores when the simulations support it, and see the methodArgs argument if independent control of parallelisation of the projection procedure is needed).
nb_cores: Integer: shortcut for specifying cluster_args$spec for sample simulation.
packages: NULL or a list with possible elements add_simulation and logL_method (the latter for the primitive workflow). These elements should be formatted as the packages arguments of add_simulation and infer_logLs, respectively, wherein they are the additional packages to be loaded on child processes. The effect of the default value of this argument is to pass over successive refine calls the value stored in the input fit object (itself determined by the latest use of the packages argument in, e.g., add_simulation or in previous refines).
env: An environment, passed as the env argument to add_simulation. The default value keeps the pre-refine value over iterations.
cl_seed: NULL or integer, passed to add_simulation. The default code uses an internal function, .update_seed, to update it from a previous iteration.

## others

target_LR: Likelihood ratio threshold used to control the sampling of new points and the selection of points for projections. Do not change it unless you known what you are doing.
method: For the primitive workflow: (a vector of) suggested method(s) for estimation of smoothing parameters (see method argument of infer_surface). The ith element of the vector is used in the ith iteration, if available; otherwise the last element is used. This argument is not always heeded, in that REML may be used if the suggested method is GCV but it appears to perform poorly. The default for SLikp and SLikp objects are "REML" and "PQL", respectively.
trypoints: A data frame of parameters on which the simulation function get_from(object,"Simulate") should be called to extend the reference table. Only for programming by expert users, because poorly thought input trypoints could severely affect the inferences.
useEI: for the primitive workflow only: cf this argument in rparam.
surfaceData: for the primitive workflow only: a data.frame with attributes, usually taken from the object and thus not specified by user, usable as input for infer_surface.
rparamFn: Function used to sample new parameter values.
...: further arguments passed to or from other methods. refine passes these arguments to the plot method suitable for the object.

Details

* Controls of exploration of parameter space: New parameter points are sampled so as to fill the space of parameters contained in the confidence regions defined by the level argument, and to surround it by a region sampled proportionally to likelihood.

Each refine call performs several iterations, these iterations stopping when ntot points have been added to the simulation table. The target number of points potentially added in each iteration is controlled by the ntot and maxit arguments as described below, but fewer points may be actually added in each iteration, and more than maxit iterations may be needed to add the ntot points, if in a given iteration too few “good” candidate points are generated according to the internal rules for sampling the parameter region with high likelihood. In that case, the next iteration tries to keep up with the missing points by adding more points than the target number, but if not enough points have been added after maxit iterations, further iterations will be run.

CIs and RMSEs may be computed in any iteration but the default values of eval_RMSEs and CIs are chosen so as to avoid performing these computations too often, particularly when they are expected to be slow. The default implies that the RMSE for the maximum logL will be computed at the end each block of iterations that defines a refine (itself defined to reach to reference table sizes specified by the workflow_design and its default value). If the reference table is not too large (see default value of useCI for the precise condition), RMSEs of the logL are also computed at the inferred bounds of profile-based confidence intervals for each parameter.

Although the update_projectors argument allow similar control of the iterations where projections are updated, it is advised to keep it NULL (default value), so that whether projectors are updated in a given iteration is controlled by default internal rules. Setting it to TRUE would induce updating whenever any of the target reference table sizes implied by the workflow_design$subblock_sizes is reached. The default NULL, as the same effect subject to additional conditions: updating may not be performed when the training set is considered too similar to the one used to compute pre-existing projections, or when the train set includes more samples than the limit define by the global package option upd_proj_subrows_thr

Default values of ntot and maxit are controlled by the value of the workflow_design, which itself has the shown default value, and are distinct for the first vs. subsequent refines. The target number of points in each iteration is also controlled differently for the first vs. subsequent refines. This design is motivated by the fact that the likelihood surface is typically poorly inferred in the first refine so that the parameter points sampled then tend to be less relevant than those that can be sampled in later iterations. In the first refine call, the target number of points increases roughly as powers of two over iterations, to reach ntot cumulatively after maxit iterations. The default ntot is twice the size of the initial reference table, and the default maxit is 5. The example_reftable Example illustrates this, where the initial reference table holds 200 simulations, and the default target number of points to be added in 5 iterations by the first refine call are 25, 25, 50, 100 and 200. In later refine calls, the target number is ntot/maxit in each iteration.

* Independent control of parallelisation may be needed in the different steps, e.g. if the simulations are not easily parallelised whereas the projection method natively handles parallelisation. In the up-to-date workflow with default ranger projection method, distinct parallelisation controls may be passed to add_reftable for sample simulations, to project methods when projections are updated, and to MSL for RMSE computations (alternatively for the primitive workflow, add_simulation, infer_logLs and MSL are called). The most explicit way of specifying distinct controls is by a list structured as


cluster_args=list(reftable=list(<makeCluster arguments>),
                  RMSEs=list(<makeCluster arguments>))

A project=list(num.threads=<.>) element can be added to this list, providing control of the num.threads argument of ranger functions. However, this is retained mainly for back compatibility as the methodArgs argument can now be used to specify the num.threads.

Simpler arguments may be used and will be interpreted as follows: nb_cores, if given and not overriden by a spec argument in cluster_args (or in sublists of it), will control simulation and projection steps (but not RMSE computation): that is, nb_cores then gives the number of parallel processes for sample simulation, with additional makeCluster arguments taken from cluster_args, but RMSE computations are performed serially. On the other hand, a spec argument in cluster_args=list(spec=<.>, <other makeCluster arguments>)) will instead apply the same arguments to both reference table and RMSE computation, overcoming the default effect of nb_cores in both of them.

Examples

Run this code

  ## see Note for links to examples.

Run the code above in your browser using DataLab