Takes as input a sparse matrix of the edge weights connecting each node to its nearest neighbors, and outputs a matrix of coordinates embedding the inputs in a lower-dimensional space.
projectKNNs(
wij,
dim = 2,
sgd_batches = NULL,
M = 5,
gamma = 7,
alpha = 1,
rho = 1,
coords = NULL,
useDegree = FALSE,
momentum = NULL,
seed = NULL,
threads = NULL,
verbose = getOption("verbose", TRUE)
)
A dense [N,D] matrix of the coordinates projecting the w_ij matrix into the lower-dimensional space.
A symmetric sparse matrix of edge weights, in C-compressed format, as created with the Matrix
package.
numeric Number of dimensions for the projection space (default=2).
The number of edges to process during SGD (default=NULL). Defaults to a value set based on the size of the dataset. If the parameter given is
between 0
and 1
, the default value will be multiplied by the parameter.
numeric Number of negative edges to sample for each positive edge (default=5).
numeric Strength of the force pushing non-neighbor nodes apart (default=7).
numeric Hyperparameter used in the default distance function, \(1 / (1 + \alpha \dot ||y_i - y_j||^2)\) (default=1). The function relates the distance between points in the low-dimensional projection to the likelihood that the two points are nearest neighbors. Increasing \(\alpha\) tends to push nodes and their neighbors closer together; decreasing \(\alpha\) produces a broader distribution. Setting \(\alpha\) to zero enables the alternative distance function. \(\alpha\) below zero is meaningless.
numeric Initial learning rate (default=1)
An initialized coordinate matrix (default=NULL).
boolean Whether to use vertex degree to determine weights (default=FALSE). If TRUE, weights determined in negative sampling; if FALSE, weights determined by the sum of the vertex's edges. See Notes.
If not NULL
(the default), SGD with momentum is used, with this multiplier, which must be between 0 and 1. Note that
momentum can drastically speed-up training time, at the cost of additional memory consumed.
numeric Random seed to be passed to the C++ functions (default=NULL). If NULL, sampled from hardware entropy pool.
Note that if the seed is not NULL
(the default), the maximum number of threads will be set to 1 in phases of the algorithm
that would otherwise be non-deterministic.
numeric The maximum number of threads to spawn (default=NULL). Determined automatically if NULL
.
boolean Verbosity (default=getOption("verbose", TRUE))
The algorithm attempts to estimate a dim
-dimensional embedding using stochastic gradient descent and
negative sampling.
The objective function is: $$ O = \sum_{(i,j)\in E} w_{ij} (\log f(||p(e_{ij} = 1||) + \sum_{k=1}^{M} E_{jk~P_{n}(j)} \gamma \log(1 - f(||p(e_{ij_k} - 1||)))$$ where \(f()\) is a probabilistic function relating the distance between two points in the low-dimensional projection space, and the probability that they are nearest neighbors.
The default probabilistic function is \(1 / (1 + \alpha \dot ||x||^2)\). If \(\alpha\) is set to zero, an alternative probabilistic function, \(1 / (1 + \exp(x^2))\) will be used instead.
Note that the input matrix should be symmetric. If any columns in the matrix are empty, the function will fail.
if (FALSE) {
data(CO2)
CO2$Plant <- as.integer(CO2$Plant)
CO2$Type <- as.integer(CO2$Type)
CO2$Treatment <- as.integer(CO2$Treatment)
co <- scale(as.matrix(CO2))
# Very small datasets often produce a warning regarding the alias table. This is safely ignored.
suppressWarnings(vis <- largeVis(t(co), K = 20, sgd_batches = 1, threads = 2))
suppressWarnings(coords <- projectKNNs(vis$wij, threads = 2))
plot(t(coords))
}
Run the code above in your browser using DataLab