projectKNNs: Project a distance matrix into a lower-dimensional space.

Description

Takes as input a sparse matrix of the edge weights connecting each node to its nearest neighbors, and outputs a matrix of coordinates embedding the inputs in a lower-dimensional space.

Usage

projectKNNs(
  wij,
  dim = 2,
  sgd_batches = NULL,
  M = 5,
  gamma = 7,
  alpha = 1,
  rho = 1,
  coords = NULL,
  useDegree = FALSE,
  momentum = NULL,
  seed = NULL,
  threads = NULL,
  verbose = getOption("verbose", TRUE)
)

Value

A dense [N,D] matrix of the coordinates projecting the w_ij matrix into the lower-dimensional space.

Arguments

wij: A symmetric sparse matrix of edge weights, in C-compressed format, as created with the Matrix package.
dim: numeric Number of dimensions for the projection space (default=2).
sgd_batches: The number of edges to process during SGD (default=NULL). Defaults to a value set based on the size of the dataset. If the parameter given is between 0 and 1, the default value will be multiplied by the parameter.
M: numeric Number of negative edges to sample for each positive edge (default=5).
gamma: numeric Strength of the force pushing non-neighbor nodes apart (default=7).
alpha: numeric Hyperparameter used in the default distance function, $1 / (1 + \alpha \dot ||y_i - y_j||^2)$ (default=1). The function relates the distance between points in the low-dimensional projection to the likelihood that the two points are nearest neighbors. Increasing $\alpha$ tends to push nodes and their neighbors closer together; decreasing $\alpha$ produces a broader distribution. Setting $\alpha$ to zero enables the alternative distance function. $\alpha$ below zero is meaningless.
rho: numeric Initial learning rate (default=1)
coords: An initialized coordinate matrix (default=NULL).
useDegree: boolean Whether to use vertex degree to determine weights (default=FALSE). If TRUE, weights determined in negative sampling; if FALSE, weights determined by the sum of the vertex's edges. See Notes.
momentum: If not NULL (the default), SGD with momentum is used, with this multiplier, which must be between 0 and 1. Note that momentum can drastically speed-up training time, at the cost of additional memory consumed.
seed: numeric Random seed to be passed to the C++ functions (default=NULL). If NULL, sampled from hardware entropy pool. Note that if the seed is not NULL (the default), the maximum number of threads will be set to 1 in phases of the algorithm that would otherwise be non-deterministic.
threads: numeric The maximum number of threads to spawn (default=NULL). Determined automatically if NULL.
verbose: boolean Verbosity (default=getOption("verbose", TRUE))

Details

The algorithm attempts to estimate a dim-dimensional embedding using stochastic gradient descent and negative sampling.

The objective function is: $$ O = \sum_{(i,j)\in E} w_{ij} (\log f(||p(e_{ij} = 1||) + \sum_{k=1}^{M} E_{jk~P_{n}(j)} \gamma \log(1 - f(||p(e_{ij_k} - 1||)))$$ where $f()$ is a probabilistic function relating the distance between two points in the low-dimensional projection space, and the probability that they are nearest neighbors.

The default probabilistic function is $1 / (1 + \alpha \dot ||x||^2)$. If $\alpha$ is set to zero, an alternative probabilistic function, $1 / (1 + \exp(x^2))$ will be used instead.

Note that the input matrix should be symmetric. If any columns in the matrix are empty, the function will fail.

Examples

Run this code

if (FALSE) {
data(CO2)
CO2$Plant <- as.integer(CO2$Plant)
CO2$Type <- as.integer(CO2$Type)
CO2$Treatment <- as.integer(CO2$Treatment)
co <- scale(as.matrix(CO2))
# Very small datasets often produce a warning regarding the alias table.  This is safely ignored.
suppressWarnings(vis <- largeVis(t(co), K = 20, sgd_batches = 1, threads = 2))
suppressWarnings(coords <- projectKNNs(vis$wij, threads = 2))
plot(t(coords))
}

Run the code above in your browser using DataLab