goftest: Assessing goodness of fit of inference using simulation

Description

A goodness-of-fit test is performed in the case projected statistics have been used for inference. Otherwise some plots of limited interest are produced.

summary and print methods for results of goftest call str to display the structure of this result.

Usage

goftest(object, nsim = 99L, method = "", stats=NULL, plot. = TRUE, nb_cores = NULL, 
        Simulate = get_from(object,"Simulate"), 
        control.Simulate=get_from(object,"control.Simulate"),
        packages = get_from(object,"packages"), 
        env = get_from(object,"env"), verbose = interactive(),
        cl_seed=.update_seed(object), get_gof_stats=.get_gof_stats)

Value

An object of class goftest, which is alist with element(s)

pval: The p-value of the test (NULL if the test is not feasible).
plotframe: The data frame which is (by default) plotted by the function. Its last line contains the residuals u for the analyzed data, and other lines contain the bootstrap replicates.

Arguments

object: an SLik or SLik_j object.
nsim: Number of draws of summary statistics.
method: For development purposes, not documented.
stats: Character vector, or NULL: the set of summary statistics to be used to construct the test. If NULL, the union, across all projections, of the raw summary statistics used for projections is potentially used for goodness of fit; however, if this set is too large for gaussian mixture modelling, a subset of variable may be selected. How they are selected is not yet fully settled (see Details).
plot.: Control diagnostic plots. plot. can be of logical, character or numeric type. If plot. is FALSE, no plot is produced. If plot. is TRUE (the default), a data frame of up to 8 goodness-of-fit statistics (the statistics denoted u in Details) is plotted. If more than eight raw summary statistics (denoted s in Details) were used, then only the first eight u are retained (see Details for the ordering of the us here). If plot. is a numeric vector, then u[plot.] are retained (possibly more than 8 statistics, as in the next case). If plot. is a character vector, then it is used to match the names of the u statistics (not of s) to be retained in the plot; the names of u are built from names of s by wrapping the latter within "Res(".")" (see axes labels of default plots for examples of valid names).
nb_cores, Simulate, packages, env, verbose: See same-named add_simulation arguments.
control.Simulate: A list of arguments of the Simulate function (seeadd_simulation). The default value should generally be used, unless e.g. it contains the path of an executable on one machine and a different path must be specified on another machine.
cl_seed: NULL or integer (see refine for Details).
get_gof_stats: function for selecting raw statistics (see Details).

Details

Testing goodness-of-fit: The test is somewhat heuristic but appears to give reasonable results (the Example shows how this can be verified). It assumes that all summary statistics are reduced to projections predicting all model parameters. It is then conceived as if any projection p predicting a parameter were a sufficient statistic for this parameter, given the information contained in the summary statistics s (this is certainly the ideal objective of machine-learning regression methods). Then a statistic u independent (under the fitted model) from all projections should be a suitable statistic for testing goodness of fit: if the model is correctly specified, the quantile of observed u, in the distribution of u under the fitted model, should be uniformly distributed over repeated sampling under the data-generating process. The procedure constructs statistics uncorrelated to all p (over repeated sampling under the fitted model) and proceeds as if they were independent from p (rather than simply uncorrelated). A number (depending on the size of the reference table) of statistics u uncorrelated to p are then defined. Each such statistic is obtained as the residual of the regression of a given raw summary statistic to all projections, where the regression input is a simulation table of nsim replicates of s under the fitted model, and of their projections p (using the “projectors” constructed from the full reference table). The latter regression involves one more, small-nsim, approximation (as it is the sample correlation that is zeroed) but using the residuals is crucially better than using the original summary statistics (as some ABC software may do). An additional feature of the procedure is to construct a single test statistic t from joint residuals u, by estimating their joint distribution (using Gaussian mixture modelling) and letting t be the density of u in this distribution.

Selection of raw summary statistics: See the code of the Infusion:::..get_gof_stats function for the method used. It requires that ranger has been used to produce the projectors, and that the latter include variable importance statistics (by default, Infusion calls ranger with argument importance="permutation"). .get_gof_stats then selects the raw summary statistics with least importance over projections (this may not be optimal, and in particular appears redundant with the procedure described below to construct goodness-of-fit statistics from raw summary statistics; so this might change in a later version), and returns a vector of names of raw statistics, sorted by increasing least-importance. The number of summary statistics can be controlled by the global package option gof_nstats_fn, a function with arguments nr and nstats for, respectively, the number of simulations of the processus (as controlled by goftest(.,nsim)) and the total number of raw summary statistics used in the projections.

The diagnostic plot will show a data frame of residuals u of the summary statistics identified as the first elements of the vector returned by Infusion:::..get_gof_stats, i.e. again a set of raw statistics with least-importance over projectors.

Examples

Run this code

### See end of example("example_reftable") for minimal example.

if (FALSE) {
### Performance of GoF test over replicate draws from data-generating process

# First, run 
example("example_reftable") 
# (at least up to the final 'slik_j' object), then

# as a shortcut, the same projections will be used in all replicates:
dprojectors <- slik_j$projectors 

set.seed(123)
gof_draws <- replicate(200, {
  cat(" ")
  dSobs <- blurred(mu=4,s2=1,sample.size=40) 
  ## ----Inference workflow-----------------------------------------------
  dprojSobs <- project(dSobs,projectors=dprojectors)
  dslik <- infer_SLik_joint(dprojSimuls,stat.obs=dprojSobs,verbose=FALSE)
  dslik <- MSL(dslik, verbose=FALSE, eval_RMSEs=FALSE)
  ## ----GoF test-----------------------------------------------
  gof <- goftest(dslik,nb_cores = 1L, plot.=FALSE,verbose=FALSE) 
  cat(unlist(gof))
  gof
})
# ~ uniform distribution under correctly-specified model: 
plot(ecdf(unlist(gof_draws)))
}

Run the code above in your browser using DataLab