dem_sample: Randomly sample documents from a dem

Description

Take a random sample of documents from a dem with/without replacement and with the option to group by a variable in dem@docvars. Note: dem_sample uses dplyr::sample_frac underneath the hood, as such size refers to the fraction of total obs.

Usage

dem_sample(x, size = NULL, replace = FALSE, weight = NULL, by = NULL)

Value

a size x D (dem-class) document-embedding-matrix corresponding to the sampled ALC embeddings. Note, @features in the resulting object will correspond to the original @features, that is, they are not subsetted to the sampled documents. For a list of the documents that were sampled call the attribute: @Dimnames$docs.

Arguments

x: a (dem-class) document-embedding-matrix
size: <tidy-select> For sample_n(), the number of rows to select. For sample_frac(), the fraction of rows to select. If tbl is grouped, size applies to each group.
replace: Sample with or without replacement?
weight: (numeric) Sampling weights. Vector of non-negative numbers of length nrow(x). Weights are automatically standardised to sum to 1 (see dplyr::sample_frac). May not be applied when by is used.
by: (character or factor vector) either of length 1 with the name of grouping variable for sampling. Refer to the variable WITH QUOTATIONS e.g. "party". Must be a variable in dem@docvars. OR of length nrow(x).

Examples

Run this code


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get a random sample
immig_wv_party <- dem_sample(immig_dem, size = 10,
replace = TRUE, by = "party")

# also works
immig_wv_party <- dem_sample(immig_dem, size = 10,
replace = TRUE, by = immig_dem@docvars$party)

Run the code above in your browser using DataLab