Learn R Programming

conos (version 1.5.2)

projectKNNs: Project a distance matrix into a lower-dimensional space.

Description

Takes as input a sparse matrix of the edge weights connecting each node to its nearest neighbors, and outputs a matrix of coordinates embedding the inputs in a lower-dimensional space.

Usage

projectKNNs(
  wij,
  dim = 2,
  sgd_batches = NULL,
  M = 5,
  gamma = 7,
  alpha = 1,
  rho = 1,
  coords = NULL,
  useDegree = FALSE,
  momentum = NULL,
  seed = NULL,
  threads = NULL,
  verbose = getOption("verbose", TRUE)
)

Value

A dense [N,D] matrix of the coordinates projecting the w_ij matrix into the lower-dimensional space.

Arguments

wij

A symmetric sparse matrix of edge weights, in C-compressed format, as created with the Matrix package.

dim

numeric Number of dimensions for the projection space (default=2).

sgd_batches

The number of edges to process during SGD (default=NULL). Defaults to a value set based on the size of the dataset. If the parameter given is between 0 and 1, the default value will be multiplied by the parameter.

M

numeric Number of negative edges to sample for each positive edge (default=5).

gamma

numeric Strength of the force pushing non-neighbor nodes apart (default=7).

alpha

numeric Hyperparameter used in the default distance function, \(1 / (1 + \alpha \dot ||y_i - y_j||^2)\) (default=1). The function relates the distance between points in the low-dimensional projection to the likelihood that the two points are nearest neighbors. Increasing \(\alpha\) tends to push nodes and their neighbors closer together; decreasing \(\alpha\) produces a broader distribution. Setting \(\alpha\) to zero enables the alternative distance function. \(\alpha\) below zero is meaningless.

rho

numeric Initial learning rate (default=1)

coords

An initialized coordinate matrix (default=NULL).

useDegree

boolean Whether to use vertex degree to determine weights (default=FALSE). If TRUE, weights determined in negative sampling; if FALSE, weights determined by the sum of the vertex's edges. See Notes.

momentum

If not NULL (the default), SGD with momentum is used, with this multiplier, which must be between 0 and 1. Note that momentum can drastically speed-up training time, at the cost of additional memory consumed.

seed

numeric Random seed to be passed to the C++ functions (default=NULL). If NULL, sampled from hardware entropy pool. Note that if the seed is not NULL (the default), the maximum number of threads will be set to 1 in phases of the algorithm that would otherwise be non-deterministic.

threads

numeric The maximum number of threads to spawn (default=NULL). Determined automatically if NULL.

verbose

boolean Verbosity (default=getOption("verbose", TRUE))

Details

The algorithm attempts to estimate a dim-dimensional embedding using stochastic gradient descent and negative sampling.

The objective function is: $$ O = \sum_{(i,j)\in E} w_{ij} (\log f(||p(e_{ij} = 1||) + \sum_{k=1}^{M} E_{jk~P_{n}(j)} \gamma \log(1 - f(||p(e_{ij_k} - 1||)))$$ where \(f()\) is a probabilistic function relating the distance between two points in the low-dimensional projection space, and the probability that they are nearest neighbors.

The default probabilistic function is \(1 / (1 + \alpha \dot ||x||^2)\). If \(\alpha\) is set to zero, an alternative probabilistic function, \(1 / (1 + \exp(x^2))\) will be used instead.

Note that the input matrix should be symmetric. If any columns in the matrix are empty, the function will fail.

Examples

Run this code
if (FALSE) {
data(CO2)
CO2$Plant <- as.integer(CO2$Plant)
CO2$Type <- as.integer(CO2$Type)
CO2$Treatment <- as.integer(CO2$Treatment)
co <- scale(as.matrix(CO2))
# Very small datasets often produce a warning regarding the alias table.  This is safely ignored.
suppressWarnings(vis <- largeVis(t(co), K = 20, sgd_batches = 1, threads = 2))
suppressWarnings(coords <- projectKNNs(vis$wij, threads = 2))
plot(t(coords))
}

Run the code above in your browser using DataLab