The kanji distance is based on matching hierarchical component structures in a nesting-free way across all levels. The cost for matching individual components is a cost for registering the components (i.e. alligning there position, scale and aspect ratio) plus the (relative unbalanced) Wasserstein distance between the registered components.
kanjidist(
k1,
k2,
compo_seg_depth1 = 3,
compo_seg_depth2 = 3,
p = 1,
C = 0.2,
approx = c("grid", "pc", "pcweighted"),
type = c("rtt", "unbalanced", "balanced"),
size = 48,
lwd = 2.5,
density = 30,
verbose = FALSE,
minor_warnings = TRUE
)
The kanji distance, a non-negative number.
two objects of type kanjivec
.
two integers \(\geq 1\). Specifies for each kanji the deepest level included for component matching. If 1, only the kanji itself is used.
the order of the Wasserstein distance used for matching components. All distances and
the penalty (if any) are taken
to the p
-th power (which is compensated by taking the p
-th root after summation).
the penalty for extra mass if type
is "rtt"
or "unbalanced"
, i.e.
we add C^p
per unit of extra mass (before applying the p
-th root).
what kind of approximation is used for matching components. If this is "grid"
,
a bitmap (raster image) is used, otherwise lines are approximated by more freely spaced points.
For "pc"
(point cloud) each point has the same weight and points are placed in a (more or
less) equidistant way. For "pcweighted"
points are further apart along straight lines and
around the center of the Bezier curves that describe the strokes. The weights of the points are
then (more or less) proportional to the amount of ink (stroke length) they represent.
the type of Wasserstein distance used for matching components based on the grid or
point cloud approximation chosen. "unbalanced"
means the weights (pixel values
if approx = "grid
) are interpreted as mass. The total masses in two components be very different.
Extra mass can be disposed of at cost C^p
per unit. "rtt"
is computationally the same,
but the final distance is divided by the maximum of the total ink (sum of weights) in each component
to the 1/p. "balanced"
means the weights are normalized so that both images have the same
total mass 1. Everything has to be transported, i.e.\ disposal of mass is not allowed.
side length of the bitmaps used for matching components (if approx = "grid
).
linewidth for drawing the components in these bitmaps (if approx = "grid
).
approximate number of discretization points per unit line length (if approx != "grid
)
logical. Whether to print detailed information on the cost for all pairs of components and the final matching.
logical. Should minor_warnings be given. If FALSE
, the warnings about substantial
distances between bitmaps/pointclouds standing for the same component and the use of a workaround due to
missing strokes in component decompositions are suppressed. While these warnings indicate to same extent
that things are not going exactly as planned, they are usually not of interest if a larger number of
kanji distances is computed and obscure the visibility of more important warnings (if any).
The interface and details of this function will change in the future. Currently only a minimal
set of parameters can be passed. The other parameters are fixed exactly as in the
"prototype distance" (4.1) of the reference below for better or worse.
There is a certain
tendency that exact matches of components are rather strongly favored (if the KanjiVG elements
agree this can overrule the unbalanced Wasserstein distance) and the penalties for
translation/scaling/distortion of components are somewhat mild.
The computation time is rather high (depending on the settings and kanji up to several
seconds per kanji pair). This can be alleviated somewhat by keeping the compo_seg_depth
parameters at 3 or lower and setting size = 32
(which goes well with lwd=1.8
).
Future versions will use a much faster line base optimal transport algorithm and further
speed-ups.
For the precise definition and details see the reference below. Parameter C
corresponds to \(b/2^{1/p}\) in the paper.
Dominic Schuhmacher (2023).
Distance maps between Japanese kanji characters based on hierarchical optimal transport.
ArXiv, tools:::Rd_expr_doi("10.48550/arXiv.2304.02493")
kanjidistmat
, kmatdist
if (requireNamespace("ROI.plugin.glpk")) {
kanjidist(fivebetas[[4]], fivebetas[[5]])
kanjidist(fivebetas[[4]], fivebetas[[5]], verbose=TRUE)
# faster and similar:
kanjidist(fivebetas[[4]], fivebetas[[5]], compo_seg_depth1=2, compo_seg_depth2=2,
size=32, lwd=1.8, verbose=TRUE)
# slower and similar:
kanjidist(fivebetas[[4]], fivebetas[[5]], size=64, lwd=3.2, verbose=TRUE)
}
Run the code above in your browser using DataLab