Learn R Programming

SamplingBigData (version 1.0.0)

SamplingBigData-package: Sampling Methods for Big Data

Description

Select sampling methods for probability samples using large data sets. This includes spatially balanced sampling in multi-dimensional spaces with any prescribed inclusion probabilities. All implementations are written in C using efficient data structures such as k-d trees that easily scale to several million rows on a modern desktop computer.

Arguments

References

Deville, J.-C. and Till<U+00E9>, Y. (1998). Unequal probability sampling without replacement through a splitting method. Biometrika 85, 89-101.

Grafstr<U+00F6>m, A. (2012). Spatially correlated Poisson sampling. Journal of Statistical Planning and Inference, 142(1), 139-147.

Grafstr<U+00F6>m, A. and Lundstr<U+00F6>m, N.L.P. (2013). Why well spread probability samples are balanced. Open Journal of Statistics, 3(1).

Grafstr<U+00F6>m, A. and Schelin, L. (2014). How to select representative samples. Scandinavian Journal of Statistics.

Grafstr<U+00F6>m, A., Lundstr<U+00F6>m, N.L.P. and Schelin, L. (2012). Spatially balanced sampling through the Pivotal method. Biometrics 68(2), 514-520.

Grafstr<U+00F6>m, A. and Till<U+00E9>, Y. (2013). Doubly balanced spatial sampling with spreading and restitution of auxiliary totals. Environmetrics, 24(2), 120-131.

Lisic, L. and Cruze, N. (2016). Local Pivotal Methods for Large Surveys. In proceedings, ICES V, Geneva Switzerland 2016.

Examples

Run this code
# NOT RUN {
# *********************************************************
# check inclusion probabilities
# *********************************************************
set.seed(1234567);
p = c(0.2, 0.25, 0.35, 0.4, 0.5, 0.5, 0.55, 0.65, 0.7, 0.9);
N = length(p);
X = cbind(runif(N),runif(N));
p1 = p2 = p3 = p4 = rep(0,N);
nrs = 1000; # increase for more precision 
for(i in seq(nrs) ){

  # lpm2 kdtree
  s = lpm2_kdtree(p,X);
  p1[s]=p1[s]+1;
  
  # pivotal method 
  s = split_sample(p);
  p2[s]=p2[s]+1;
  
}
print(p);
print(p1/nrs);
print(p2/nrs);
# }

Run the code above in your browser using DataLab