lsmi_cv: Cross-validation to Select an Optimal Combination of n.seed and n.wave

Description

From the vector of specified n.seeds and possible waves 1:n.wave around each seed, the function selects a single number n.seed and an n.wave (optimal seed-wave combination) that produce a labeled snowball with multiple inclusions (LSMI) sample with desired bootstrap confidence intervals for a parameter of interest. Here by `desired' we mean that the interval (and corresponding seed-wave combination) are selected as having the best coverage (closest to the specified level prob), based on a cross-validation procedure with proxy estimates of the parameter. See Algorithm 2 by gel_etal_2017;textualsnowboot and Details below.

Usage

lsmi_cv(
  net,
  n.seeds,
  n.wave,
  seeds = NULL,
  B = 100,
  prob = 0.95,
  cl = 1,
  param = c("mu"),
  method = c("percentile", "basic"),
  proxyRep = 19,
  proxySize = 30
)

Arguments

net

a network object that is a list containing:

degree: the degree sequence of the network, which is an integer vector of length $n$;
edges: the edgelist, which is a two-column matrix, where each row is an edge of the network;
n: the network order (i.e., number of nodes in the network).

The network object can be simulated by random_network, selected from the networks available in artificial_networks, converged from an igraph object with igraph_to_network, etc.

n.seeds

an integer vector of numbers of seeds for snowball sampling (cf. a single integer n.seed in lsmi). Only n.seeds <= n are retained. If seeds is specified, only values n.seeds < length(unique(seeds)) are retained and automatically supplemented by length(unique(seeds)).

n.wave

an integer defining the number of waves (order of the neighborhood) to be recorded around the seed in the LSMI. For example, n.wave = 1 corresponds to an LSMI with the seed and its first neighbors. Note that the algorithm allows for multiple inclusions.

seeds

a vector of numeric IDs of pre-specified seeds. If specified, LSMIs are constructed around each such seed.

a positive integer, the number of bootstrap replications to perform. Default is 100.

prob

confidence level for the intervals. Default is 0.95 (i.e., 95% confidence).

parameter to specify computer cluster for bootstrapping, passed to the package parallel (default is 1, meaning no cluster is used). Possible values are:

cluster object (list) produced by makeCluster. In this case, new cluster is not started nor stopped;
NULL. In this case, the function will attempt to detect available cores (see detectCores) and, if there are multiple cores ($>1$), a cluster will be started with makeCluster. If started, the cluster will be stopped after computations are finished;
positive integer defining the number of cores to start a cluster. If cl = 1, no attempt to create a cluster will be made. If cl > 1, cluster will be started (using makeCluster) and stopped afterwards (using stopCluster).

param

The parameter of interest for which to run a cross-validation and select optimal n.seed and n.wave. Currently, only one selection is possible: "mu" (the network mean degree).

method

method for calculating the bootstrap intervals. Default is "percentile" (see Details).

proxyRep

The number of times to repeat proxy sampling. Default is 19.

proxySize

The size of the proxy sample. Default is 30.

Value

A list consisting of:

bci

A numeric vector of length 2 with the bootstrap confidence interval (lower bound, upper bound) for the parameter of interest. This interval is obtained by bootstrapping node degrees in an LSMI with the optimal combination of n.seed and n.wave (the combination is reported in best_combination).

estimate

Point estimate of the parameter of interest (based on the LSMI with n.seed seeds and n.wave waves reported in the best_combination).

best_combination

An integer vector of lenght 2 containing the optimal n.seed and n.wave selected via cross-validation.

seeds

A vector of numeric IDs of the seeds that were used in the LSMI with the optimal combination of n.seed and n.wave.

Details

Currently, the bootstrap intervals can be calculated with two alternative methods: "percentile" or "basic". The "percentile" intervals correspond to Efron's $100\cdot$prob% intervals @see @efron_1979, also Equation 5.18 by @davison_hinkley_1997 and Equation 3 by @gel_etal_2017, @chen_etal_2018_snowbootsnowboot: $$(\theta^*_{[B\alpha/2]}, \theta^*_{[B(1-\alpha/2)]}),$$ where $\theta^*_{[B\alpha/2]}$ and $\theta^*_{[B(1-\alpha/2)]}$ are empirical quantiles of the bootstrap distribution with B bootstrap replications for parameter $\theta$ ($\theta$ can be the $f(k)$ or $\mu$), and $\alpha = 1 -$ prob.

The "basic" method produces intervals @see Equation 5.6 by @davison_hinkley_1997snowboot: $$(2\hat{\theta} - \theta^*_{[B(1-\alpha/2)]}, 2\hat{\theta} - \theta^*_{[B\alpha/2]}),$$ where $\hat{\theta}$ is the sample estimate of the parameter. Note that this method can lead to negative confidence bounds, especially when $\hat{\theta}$ is close to 0.

References

Examples

Run this code

# NOT RUN {
net <- artificial_networks[[1]]
a <- lsmi_cv(net, n.seeds = c(10, 20, 30), n.wave = 5, B = 100)

# }

Run the code above in your browser using DataLab