splitSample: Select samples from along an environmental gradient

Description

Select samples from along an environmental gradient by splitting the gradient into discrete chunks and sample within each chunk. This allows a test set to be selected which covers the environmental gradient of the training set, for example.

Usage

splitSample(env, chunk = 10, take, nchunk,
            fill = c("head", "tail", "random"),
            maxit = 1000)

Arguments

env

numeric; vector of samples representing the gradient values.

chunk

numeric; number of chunks to split the gradient into.

take

numeric; how many samples to take from the gradient. Can not be missing.

nchunk

numeric; number of samples per chunk. Must be a vector of length chunk and sum(chunk) must equal take. Can be missing (the default), in which case some simple heuristics are used to determine the number o

fill

character; the type of filling of chunks to perform. See Details.

maxit

numeric; maximum number of iterations in which to try to sample take observations. Basically here to stop the loop going on forever.

Value

A numeric vector of indices of selected samples. This vector has attribute lengths which indicates how many samples were actually chosen from each chunk.

Details

The gradient is split into chunk sections and samples are selected from each chunk to result in a sample of length take. If take is divisible by chunk without remainder then there will an equal number of samples selected from each chunk. Where chunk is not a multiple of take and nchunk is not specified then extra samples have to be allocated to some of the chunks to reach the required number of samples selected.

An additional complication is that some chunks of the gradient may have fewer than nchunk samples and therefore more samples need to be selected from the remaining chunks until take samples are chosen.

If nchunk is supplied, it must be a vector stating exactly how many samples to select from each chunk. If chunk is not supplied, then the number of samples per chunk is determined as follows:

An intial allocation offloor(take / chunk)is assigned to each chunk
If any chunks have fewer samples than this initial allocation, these elements ofnchunkare reset to the number of samples in those chunks
Sequentially an extra sample is allocated to each chunk with sufficient available samples untiltakesamples are selected.

Argument fill controls the order in which the chunks are filled. fill = "head" fills from the low to the high end of the gradient, whilst fill = "tail" fills in the opposite direction. Chunks are filled in random order if fill = "random". In all cases no chunk is filled by more than one extra sample until all chunks that can supply one extra sample are filled. In the case of fill = "head" or fill = "tail" this entails moving along the gradient from one end to the other allocating an extra sample to available chunks before starting along the gradient again. For fill = "random", a random order of chunks to fill is determined, if an extra sample is allocated to each chunk in the random order and take samples are still not selected, filling begins again using the same random ordering. In other words, the random order of chunks to fill is chosen only once.

Examples

Run this code

data(swappH)

## take a test set of 20 samples along the pH gradient
test1 <- splitSample(swappH, chunk = 10, take = 20)
test1
swappH[test1]

## take a larger sample where some chunks don't have many samples
## do random filling
set.seed(3)
test2 <- splitSample(swappH, chunk = 10, take = 70, fill = "random")
test2
swappH[test2]

Run the code above in your browser using DataLab