This is the main function of clugenr, and possibly the only function most users will need.
clugen(
num_dims,
num_clusters,
num_points,
direction,
angle_disp,
cluster_sep,
llength,
llength_disp,
lateral_disp,
allow_empty = FALSE,
cluster_offset = NA,
proj_dist_fn = "norm",
point_dist_fn = "n-1",
clusizes_fn = clusizes,
clucenters_fn = clucenters,
llengths_fn = llengths,
angle_deltas_fn = angle_deltas,
seed = NA
)A named list with the following elements:
points: A num_points x num_dims matrix with the generated points for
all clusters.
clusters: A num_points factor vector indicating which cluster
each point in points belongs to.
projections: A num_points x num_dims matrix with the point
projections on the cluster-supporting lines.
sizes: A num_clusters x 1 vector with the number of points in
each cluster.
centers: A num_clusters x num_dims matrix with the
coordinates of the cluster centers.
directions: A num_clusters x num_dims matrix with the final
direction of each cluster-supporting line.
angles: A num_clusters x 1 vector with the angles between the
cluster-supporting lines and the main direction.
lengths: A num_clusters x 1 vector with the lengths of the
cluster-supporting lines.
Number of dimensions.
Number of clusters to generate.
Total number of points to generate.
Average direction of the cluster-supporting lines. Can be
a vector of length num_dims (same direction for all clusters) or a
matrix of size num_clusters x num_dims (one direction per cluster).
Angle dispersion of cluster-supporting lines (radians).
Average cluster separation in each dimension (vector of
length num_dims).
Average length of cluster-supporting lines.
Length dispersion of cluster-supporting lines.
Cluster lateral dispersion, i.e., dispersion of points from their projection on the cluster-supporting line.
Allow empty clusters? FALSE by default.
Offset to add to all cluster centers (vector of length
num_dims). By default there will be no offset.
Distribution of point projections along cluster-supporting lines, with three possible values:
"norm" (default): Distribute point projections along lines using a normal
distribution (=μ= line_center,
=σ= llength/6 ).
"unif": Distribute points uniformly along the line.
User-defined function, which accepts two parameters, line length (double)
and number of points (integer), and returns a vector containing the
distance of each point projection to the center of the line. For example,
the "norm" option roughly corresponds to
function(l, n) stats::rnorm(n, sd = l / 6).
Controls how the final points are created from their projections on the cluster-supporting lines, with three possible values:
"n-1" (default): Final points are placed on a hyperplane orthogonal to
the cluster-supporting line, centered at each point's projection, using the
normal distribution (=0μ=0,
=σ= lateral_disp ). This is done by the clupoints_n_1
function.
"n": Final points are placed around their projection on the
cluster-supporting line using the normal distribution (=0μ=0,
=σ= lateral_disp ). This is done by the clupoints_n
function.
User-defined function: The user can specify a custom point placement strategy by passing a function with the same signature as clupoints_n_1 and clupoints_n.
Distribution of cluster sizes. By default, cluster sizes
are determined by the clusizes function, which uses the normal distribution
(=μ= num_points/num_clusters, =/3σ=μ/3),
and assures that the final cluster sizes add up to num_points. This
parameter allows the user to specify a custom function for this purpose,
which must follow clusizes signature. Note that custom functions are not
required to strictly obey the num_points parameter. Alternatively, the user
can specify a vector of cluster sizes directly.
Distribution of cluster centers. By default, cluster
centers are determined by the clucenters function, which uses the uniform
distribution, and takes into account the num_clusters and cluster_sep
parameters for generating well-distributed cluster centers. This parameter
allows the user to specify a custom function for this purpose, which must
follow clucenters signature. Alternatively, the user can specify a matrix
of size num_clusters x num_dims with the exact cluster centers.
Distribution of line lengths. By default, the lengths of
cluster-supporting lines are determined by the llengths function, which
uses the folded normal distribution (=μ= llength,
=σ= llength_disp ). This parameter allows the user to
specify a custom function for this purpose, which must follow llengths
signature. Alternatively, the user can specify a vector of line lengths
directly.
Distribution of line angle differences with respect to
direction. By default, the angles between the main direction of each
cluster and the final directions of their cluster-supporting lines are
determined by the angle_deltas function, which uses the wrapped normal
distribution (=0μ=0, =σ= angle_disp ) with
support in the interval [-/2,/2][-π/2, π/2]. This
parameter allows the user to specify a custom function for this purpose,
which must follow angle_deltas signature. Alternatively, the user can
specify a vector of angle deltas directly.
An integer used to initialize the PRNG, allowing for reproducible
results. If specified, seed is simply passed to set.seed.
If a custom function was given in the clusizes_fn parameter, it is
possible that num_points may have a different value than what was
specified in the num_points parameter.
The terms "average" and "dispersion" refer to measures of central tendency and statistical dispersion, respectively. Their exact meaning depends on the optional arguments.
# 2D example
x <- clugen(2, 5, 1000, c(1, 3), 0.5, c(10, 10), 8, 1.5, 2)
graphics::plot(x$points, col = x$clusters, xlab = "x", ylab = "y", asp = 1)
# 3D example
x <- clugen(3, 5, 1000, c(2, 3, 4), 0.5, c(15, 13, 14), 7, 1, 2)
Run the code above in your browser using DataLab