subsample: Serial subsampling wrapper function

Description

The function will take a desired function that has an occurrence dataset as an argument, and reruns it iteratively on the subsets of the dataset.

Usage

subsample(dat, q, tax = "genus", bin = "stg", FUN = divDyn,
  coll = NULL, ref = NULL, iter = 50, type = "cr", keep = NULL,
  rem = NULL, duplicates = TRUE, output = "arit",
  useFailed = FALSE, ...)

Arguments

dat

(data.frame): Occurrence dataset, with bin, tax and coll as column names.

(numeric): Subsampling level argument (mandatory). Depends on the subsampling function, it is the number of occurrences for "cr", and the number of desired occurrences to the power of x for O^x^W. It is also the quorum of the SQS method.

tax

(character): The name of the taxon variable.

bin

(character): The name of the subsetting variable (has to be integer). For time series, this is the time-slice variable. Rows with NA entries in this column will be omitted.

FUN

(function): The function to be iteratively executed on the results of the subsampling trials. If set to NULL, no function will be executed, and the subsampled datasets will be returned as a list. By default set to the divDyn function. The function must have an argument called dat, that represents the dataset resulting from a subsampling trial (or the entire dataset). Arguments of the subsample function call will be searched for potential arguments of this function, which means that already provided variables (e.g. bin and tax) will also be used. You can also provide additional arguments (similarly to the apply iterator). Functions that allow arguments to pass through (that have argument '...') are not allowed, as well as functions that have the same arguments as subsample but would require different values.

coll

(character): The variable name of the collection identifiers.

ref

(character): The name of the reference variable, optional - depending on the subsampling method.

iter

(numeric): The number of iterations to be executed.

type

(character): The type of subsampling to be implemented. By default this is classical rarefaction ("cr"). ("oxw") stands for occurrence weighted by-list subsampling. If set to ("sqs"), the program will execute the shareholder quorum subsampling algorithm as it was suggested by Alroy (2010). Setting the argument to "none" will invoke no subsamling, but the applied function will be iterated on the trials, nevertheless.

keep

(numeric): The bins, which will not be subsampled but will be added to the subsampling trials. NIf the number of occurrences does not reach the subsampling quota, by default it will not be represented in the subsampling trials. You can force their inclusion with the keep argument separetely (for all, see the useFailed argument).

rem

(numeric): The bins, which will be removed from the dataset before the subsampling trials.

duplicates

(logical ): Toggles whether multiple entries from the same taxon ("tax") and collection ("coll") variables should be omitted. Useful for omitting occurrences of multiple species-level occurrences of the same genus. By default these are allowed through analyses (duplicates=TRUE), setting this to FALSE will require you to provide a collection variable. (coll)

output

(character): If the function output are vectors or matrices, the "arit" and "geom" values will trigger simple averaging with arithmetic or geometric means. If the function output of a single trial is again a vector or a matrix, setting the output to "dist" will return the calculated results of every trial, organized in a list of independent variables (e.g. if the function output is value, the return will contain a single vector, if it is a vector, the output will be a list of vectors, if the function output is a data.frame, the output will be a list of matrix class objects). If output="list", the structure of the original function output will be retained, and the results of the individual trials will be concatenated to a list.

useFailed

(logical): If the bin does not reach the subsampling quota, should the bin be used?

...

arguments passed to FUN and the type-specific subsampling functions: subtrialCR, subtrialOXW, subtrialSQS

Details

The subsample function implements the iterative framework of the sampling standardization procedure. The function 1. takes the dataset dat, 2. runs function FUN on the dataset and creates a container for results of trials 3. runs one of the subsampling trial functions (e.g. subtrialCR) to get a subsampled 'trial dataset' 4. runs FUN on the trial dataset and 5. averages the results of the trials. For a detailed treatment on what the function does, please see the vignette ('Handout to the R package 'divDyn' v0.5.0 for diversity dynamics from fossil occurrence data'). Currently the Classical Rarefaction ("cr", Raup, 1975), the occurrence weighted by-list subsampling ("oxw", Alroy et al., 2001) and the Shareholder Quorum Subsampling methods are implemented ("sqs", Alroy, 2010).

References:

Alroy, J., Marshall, C. R., Bambach, R. K., Bezusko, K., Foote, M., F<U+00FC>rsich, F. T., <U+2026> Webber, A. (2001). Effects of sampling standardization on estimates of Phanerozoic marine diversification. Proceedings of the National Academy of Science, 98(11), 6261-6266.

Alroy, J. (2010). The Shifting Balance of Diversity Among Major Marine Animal Groups. Science, 329, 1191-1194. https://doi.org/10.1126/science.1189910

Raup, D. M. (1975). Taxonomic Diversity Estimation Using Rarefaction. Paleobiology, 1, 333-342. https: //doi.org/10.2307/2400135

Examples

Run this code

# NOT RUN {
data(corals)
data(stages)
# Example 1-calculate metrics of diversity dynamics
  dd <- divDyn(corals, tax="genus", bin="stg")
  rarefDD<-subsample(corals,iter=50, q=50,
  tax="genus", bin="stg", output="dist", keep=95)
	
# plotting
  tsplot(stages, shading="series", boxes="sys", xlim=c(260,0), 
  ylab="range-through diversity (genera)", ylim=c(0,230))
  lines(stages$mid, dd$divRT, lwd=2)
  shades(stages$mid, rarefDD$divRT, col="blue")
  legend("topleft", legend=c("raw","rarefaction"),
    col=c("black", "blue"), lwd=c(2,2), bg="white")
  
  
  
# }
# NOT RUN {
# Example 2-SIB diversity 
# draft a simple function to calculate SIB diversity
sib<-function(dat, bin, tax){
  calc<-tapply(INDEX=dat[,bin], X=dat[,tax], function(y){
    length(levels(factor(y)))
  })
  return(calc[as.character(stages$stg)])
}
sibDiv<-sib(corals, bin="stg", tax="genus")

# calculate it with subsampling
rarefSIB<-subsample(corals,iter=50, q=50,
  tax="genus", bin="stg", output="arit", keep=95, FUN=sib)
rarefDD<-subsample(corals,iter=50, q=50,
  tax="genus", bin="stg", output="arit", keep=95)

# plot
tsplot(stages, shading="series", boxes="sys", xlim=c(260,0), 
  ylab="SIB diversity (genera)", ylim=c(0,230))

lines(stages$mid, rarefDD$divSIB, lwd=2, col="black")
lines(stages$mid, rarefSIB, lwd=2, col="blue")


# Example 3 - different subsampling types with default function (divDyn)
# compare different subsampling types
  # classical rarefaction
  cr<-subsample(corals,iter=50, q=20,tax="genus", bin="stg", output="dist", keep=95)
  # by-list subsampling (unweighted) - 3 collections
  UW<-subsample(corals,iter=50, q=3,tax="genus", bin="stg", coll="collection_no", 
    output="dist", keep=95, type="oxw", x=0)
  # occurrence weighted by list subsampling
  OW<-subsample(corals,iter=50, q=20,tax="genus", bin="stg", coll="collection_no", 
    output="dist", keep=95, type="oxw", x=1)
 
  SQS<-subsample(corals,iter=50, q=0.4,tax="genus", bin="stg", output="dist", keep=95, type="sqs")

# plot
  tsplot(stages, shading="series", boxes="sys", xlim=c(260,0), 
  ylab="range-through diversity (genera)", ylim=c(0,100))
  shades(stages$mid, cr$divRT, col="red")
  shades(stages$mid, UW$divRT, col="blue")
  shades(stages$mid, OW$divRT, col="green")
  shades(stages$mid, SQS$divRT, col="cyan")
  
  legend("topleft", bg="white", legend=c("CR (20)", "UW (3)", "OW (20)", "SQS (0.4)"), 
    col=c("red", "blue", "green", "cyan"), lty=c(1,1,1,1), lwd=c(2,2,2,2))
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab