multistageoptimal.nlm: Function for optimizing n-stage selection with the NLM algorithm for fixed correlation matrix

Description

This function uses the non-linear minimization function nlm in R-package stat for n-stage selection

Usage

multistageoptimal.nlm(N.upper, N.lower, corr, ini.value, 
Budget, CostC, CostTv, N.fs, iterlim, alg)

Arguments

N.upper

Vector with length n. It is the vector of upper limits of candidates X.

N.lower

Vector with length n. It is the vector of lower limits of candidates X.

corr

(n+1)-dimensional matrix. It is the correlation matrix $\bm{\Sigma}^{*}$ of true value y and selection indices X. More detail see multistagegain.each.

ini.value

Vector with length n. It stores the number of candidates in each stage for the algorithm to begin with. As default, it will use $N={N_1,N_2,...,N_n}={a+1,...,a+n}$, where a is defined as Budget/(CostC+sum(CostTv)+1).

Budget

A double value. It contains the value of total budget.

CostC

A double value. It contains the costs of producing and identifying a candidate.

CostTv

A double value. It contains a vector with length n reflecting the cost of evaluating a candidate in the tests performed at stage i, i=1,...,n. The cost might vary in different stages.

N.fs

Vector with length n. It is the number of final selected candidates.

iterlim

An integer value. It is the maximum number of iterations to be executed before the Newton algorithm is terminated. By default it is equal to 20. If the $\texttt{Budget}$ increases 10 times for making the selection, the value of $\texttt{iterlim}$ has to b

alg

An object used to switch between two algorithms. More detail see multistagegain.

Value

The output of this function is a vector similar as in multistageoptimal.grid if detail = FALSE. However, the optimal number of candidates in each stage determined by the NLM algorithm is clearly not an integer, because the function uses a numerical algorithm, which depends on derivatives.

Details

Suppose we start with $N_1$ candidates in stage one. Based on the evaluation of the $N_i$ candidates in stage $i$, the best $N_{i+1}$ candidates, i.e., those with $x_i \geq q_i$, are promoted to the next stage, where they are evaluated with even higher intensity to obtain more precise estimates of the true value $y$. The goal of the whole selection scheme is to select the best $N_{n+1}$ candidates after $n$ stages of selection. In practice, the selection program has only a limited budget $B$ to cover all costs such as (i) identifying or producing the initial $N_1$ candidates and (ii) evaluating the $N_i$ candidates in stage $i$. For a given testing scheme with $\textbf{N}=(N_1,\ldots,N_n)$ candidates in the $i$-th stage of selection ($i=1,\ldots,n$), the costs may be given by the cost function $C(\omega)$. Thus, the set of admissible allocations $\Omega (B)$ of the candidates to the various stages of selection is given by $$\Omega (B):= { \omega =\textbf{N}|C(\omega)\leq B}$$ Hence, our goal is to find $\tilde{\omega} \in \Omega (B)$ with $$\Delta G(y|\textbf{S}_{\tilde{\omega}}, \Sigma^{*}) = \underbrace{max}_{\omega \in \Omega (B)} \Delta G(y|\textbf{S}_{\omega},\Sigma^{*}),$$ where $\textbf{S}_{\omega}$ refers to the truncation point $\textbf{Q}$ corresponding to $\textbf{A}={ \alpha_1,\ldots,\alpha_n }$, with $\alpha_i=N_{i+1}/N_i$ for $i=1,\ldots,n$. The matrix $\bm{\Sigma}^{*}$ is determined by the correlations among test scores $x_i$ obtained in the $n$ stages of selection as well as their correlations to the target value $y$. Hence, for given but possibly different testing procedures in each stage, $\bm{\Sigma}^{*}$ is fixed, independent of the choice of $\textbf{N}$. In many applications in breeding and other fields, the choice of $\textbf{N}$ does not affect the correlation matrix $\bm{\Sigma}^{*}$ for the candidates. Examples include different types of average in the various stages of selection such as tests for evaluating the disease symptoms (e.g., test of fusarium resistance by visual recording of disease symptoms, estimation of mycotoxin concentration by NIRS, ELISA or GC-MS) or genomic usages with different prediction accuracy and costs (marker arrays with different coverage of the genome, transcript and for metabolic profiles). All these situations can be coped within this frame work outlined above. The simplest way to find the maximum is to do a full scan of the entire set $\Omega (B)$, which calculates $\Delta G(y|\textbf{S}_{\omega}, \bm{\Sigma}^{*})$ for all possible allocations of $\omega (B)$ to determine $\tilde{\omega}$ yielding the largest $\Delta G$. However, this is very time consuming. An alternative solution is to use grid search, which divides the whole set $\Omega (B)$ into several grids Kim (1997). Another way for finding the maximum is using an optimization algorithm for Non-Linear Minimization (NLM) provided by function nlm in package stats. It uses a Newton-type algorithm for searching the maximum of a multi-modal function. This algorithm depends heavily on the starting point, the maximum number of iterations and the numerical derivatives of $\Delta G$ and results in an accuracy less than four digits. For maximizing the selection gain, NLM algorithm will converge to the global maximum . A proper choice of the initial value is recommended.

Examples

Run this code

corr=matrix( c(1,       0.3508,0.3508,0.4979,
               0.3508  ,1,     0.3016,0.5630,
               0.3508,  0.3016,1     ,0.5630,
               0.4979,  0.5630,0.5630,1), 
              nrow=4  
)

multistageoptimal.nlm(N.upper=rep(100,3), corr=corr, Budget=200, CostC=0.5, N.fs=5)