Testing goodness-of-fit: The test is somewhat heuristic but appears to give reasonable results (the Example shows how this can be verified). It assumes that all summary statistics are reduced to projections predicting all model parameters. It is then conceived as if any projection p predicting a parameter were a sufficient statistic for this parameter, given the information contained in the summary statistics s (this is certainly the ideal objective of machine-learning regression methods). Then a statistic u independent (under the fitted model) from all projections should be a suitable statistic for testing goodness of fit: if the model is correctly specified, the quantile of observed u, in the distribution of u under the fitted model, should be uniformly distributed over repeated sampling under the data-generating process. The procedure constructs statistics uncorrelated to all p (over repeated sampling under the fitted model) and proceeds as if they were independent from p (rather than simply uncorrelated). A number (depending on the size of the reference table) of statistics u uncorrelated to p are then defined. Each such statistic is obtained as the residual of the regression of a given raw summary statistic to all projections, where the regression input is a simulation table of nsim
replicates of s under the fitted model, and of their projections p (using the “projectors” constructed from the full reference table). The latter regression involves one more, small-nsim
, approximation (as it is the sample correlation that is zeroed) but using the residuals is crucially better than using the original summary statistics (as some ABC software may do). An additional feature of the procedure is to construct a single test statistic t from joint residuals u, by estimating their joint distribution (using Gaussian mixture modelling) and letting t be the density of u in this distribution.
Selection of raw summary statistics: See the code of the Infusion:::..get_gof_stats
function for the method used. It requires that ranger
has been used to produce the projectors, and that the latter include variable importance statistics (by default, Infusion calls ranger
with argument importance="permutation"
). .get_gof_stats
then selects the raw summary statistics with least importance over projections (this may not be optimal, and in particular appears redundant with the procedure described below to construct goodness-of-fit statistics from raw summary statistics; so this might change in a later version), and returns a vector of names of raw statistics, sorted by increasing least-importance. The number of summary statistics can be controlled by the global package option gof_nstats_fn
, a function with arguments nr
and nstats
for, respectively, the number of simulations of the processus (as controlled by goftest(.,nsim)
) and the total number of raw summary statistics used in the projections.
The diagnostic plot will show a data frame of residuals u of the summary statistics identified as the first elements of the vector returned by Infusion:::..get_gof_stats
, i.e. again a set of raw statistics with least-importance over projectors.