lma_lspace: Latent Semantic Space (Embeddings) Operations

Description

Map a document-term matrix onto a latent semantic space, extract terms from a latent semantic space (if dtm is a character vector, or map.space = FALSE), or perform a singular value decomposition of a document-term matrix (if dtm is a matrix and space is missing).

Usage

lma_lspace(dtm = "", space, map.space = TRUE, fill.missing = FALSE,
  term.map = NULL, dim.cutoff = 0.5, keep.dim = FALSE,
  use.scan = FALSE, dir = getOption("lingmatch.lspace.dir"))

Value

A matrix or sparse matrix with either (a) a row per term and column per latent dimension (a latent space, either calculated from the input, or retrieved when map.space = FALSE), (b) a row per document and column per latent dimension (when a dtm is mapped to a space), or (c) a row per document and column per term (when a space is calculated and keep.dim = TRUE).

Arguments

dtm: A matrix with terms as column names, or a character vector of terms to be extracted from a specified space. If this is of length 1 and space is missing, it will be treated as space.
space: A matrix with terms as rownames. If missing, this will be the right singular vectors of a singular value decomposition of dtm. If a character, a file matching the character will be searched for in dir (e.g., space = 'google'). If a file is not found and the character matches one of the available spaces, you will be given the option to download it, as handled by download.lspace. If dtm is missing, the entire space will be loaded and returned.
map.space: Logical: if FALSE, the original vectors of space for terms found in dtm are returned. Otherwise dtm %*% space is returned, excluding uncommon columns of dtm and rows of space.
fill.missing: Logical: if TRUE and terms are being extracted from a space, includes terms not found in the space as rows of 0s, such that the returned matrix will have a row for every requested term.
term.map: A matrix with space as a column name, terms as row names, and indices of the terms in the given space as values, or a numeric vector of indices with terms as names, or a character vector of terms corresponding to rows of the space. This is used instead of reading in an "_terms.txt" file corresponding to a space entered as a character (the name of a space file).
dim.cutoff: If a space is calculated, this will be used to decide on the number of dimensions to be retained: cumsum(d) / sum(d) < dim.cutoff, where d is a vector of singular values of dtm (i.e., svd(dtm)$d). The default is .5; lower cutoffs result in fewer dimensions.
keep.dim: Logical: if TRUE, and a space is being calculated from the input, a matrix in the same dimensions as dtm is returned. Otherwise, a matrix with terms as rows and dimensions as columns is returned.
use.scan: Logical: if TRUE, reads in the rows of space with scan.
dir: Path to a folder containing spaces.
Set a session default with options(lingmatch.lspace.dir = 'desired/path').

Examples

Run this code

text <- c(
  paste(
    "Hey, I like kittens. I think all kinds of cats really are just the",
    "best pet ever."
  ),
  paste(
    "Oh year? Well I really like cars. All the wheels and the turbos...",
    "I think that's the best ever."
  ),
  paste(
    "You know what? Poo on you. Cats, dogs, rabbits -- you know, living",
    "creatures... to think you'd care about anything else!"
  ),
  paste(
    "You can stick to your opinion. You can be wrong if you want. You know",
    "what life's about? Supercharging, diesel guzzling, exhaust spewing,",
    "piston moving ignitions."
  )
)

dtm <- lma_dtm(text)

# calculate a latent semantic space from the example text
lss <- lma_lspace(dtm)

# show that document similarities between the truncated and full space are the same
spaces <- list(
  full = lma_lspace(dtm, keep.dim = TRUE),
  truncated = lma_lspace(dtm, lss)
)
sapply(spaces, lma_simets, metric = "cosine")

if (FALSE) {

# specify a directory containing spaces,
# or where you would like to download spaces
space_dir <- "~/Latent Semantic Spaces"

# map to a pretrained space
ddm <- lma_lspace(dtm, "100k", dir = space_dir)

# load the matching subset of the space
# without mapping
lss_100k_part <- lma_lspace(colnames(dtm), "100k", dir = space_dir)

## or
lss_100k_part <- lma_lspace(dtm, "100k", map.space = FALSE, dir = space_dir)

# load the full space
lss_100k <- lma_lspace("100k", dir = space_dir)

## or
lss_100k <- lma_lspace(space = "100k", dir = space_dir)
}

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples