clusterMatch: clusterMatch

Description

Creates properly sized clusters for matching, using either alphabetical or word embedding clustering. If using word embedding, the function first creates a word embedding out of the provided vectors, and then runs PCA on the matrix. It then takes the first k dimensions (where k is provided by the user) and k-means is run on that matrix to get the clusters.

Usage

clusterMatch(vecA, vecB, nclusters, max.n, word.embed, min.var,
weighted.kmeans, iter.max)

Arguments

vecA

The character vector from dataset A

vecB

The character vector from dataset B

nclusters

The number of clusters to create from the provided data. Either nclusters = NULL or max.n = NULL.

max.n

The maximum size of either dataset A or dataset B in the largest cluster. Either nclusters = NULL or max.n = NULL

word.embed

Whether to use word embedding clustering. Default is FALSE.

min.var

The minimum amount of explained variance (maximum = 1) a PCA dimension can provide in order to be included in k-means clustering when using word embedding. Default is .20.

weighted.kmeans

Whether to weight the k-means algorithm features by the explained variance of the included principal component when using word embedding clustering. Default is FALSE.

iter.max

Maximum number of iterations for the k-means algorithm.

Value

clusterMatch returns a list of length 3:

clusterA

The cluster assignments for dataset A

clusterB

The cluster assignments for dataset B

n.clusters

The number of clusters created

kmeans

The k-means object output.

pca

The PCA object output.

dims.pca

The number of dimensions from PCA used for the k-means clustering.

Examples

Run this code

# NOT RUN {
data(samplematch)
cl <- clusterMatch(dfA$firstname, dfB$firstname, nclusters = 3)
# }

Run the code above in your browser using DataLab