kanjidist: Compute distance between two kanjivec objects based on hierarchical optimal transport

Description

The kanji distance is based on matching hierarchical component structures in a nesting-free way across all levels. The cost for matching individual components is a cost for registering the components (i.e. alligning there position, scale and aspect ratio) plus the (relative unbalanced) Wasserstein distance between the registered components.

Usage

kanjidist(
  k1,
  k2,
  compo_seg_depth1 = 3,
  compo_seg_depth2 = 3,
  p = 1,
  C = 0.2,
  approx = c("grid", "pc", "pcweighted"),
  type = c("rtt", "unbalanced", "balanced"),
  size = 48,
  lwd = 2.5,
  density = 30,
  verbose = FALSE,
  minor_warnings = TRUE
)

Value

The kanji distance, a non-negative number.

Arguments

k1, k2: two objects of type kanjivec.
compo_seg_depth1, compo_seg_depth2: two integers \(\geq 1\). Specifies for each kanji the deepest level included for component matching. If 1, only the kanji itself is used.
p: the order of the Wasserstein distance used for matching components. All distances and the penalty (if any) are taken to the p-th power (which is compensated by taking the p-th root after summation).
C: the penalty for extra mass if type is "rtt" or "unbalanced", i.e. we add C^p per unit of extra mass (before applying the p-th root).
approx: what kind of approximation is used for matching components. If this is "grid", a bitmap (raster image) is used, otherwise lines are approximated by more freely spaced points. For "pc" (point cloud) each point has the same weight and points are placed in a (more or less) equidistant way. For "pcweighted" points are further apart along straight lines and around the center of the Bezier curves that describe the strokes. The weights of the points are then (more or less) proportional to the amount of ink (stroke length) they represent.
type: the type of Wasserstein distance used for matching components based on the grid or point cloud approximation chosen. "unbalanced" means the weights (pixel values if approx = "grid) are interpreted as mass. The total masses in two components be very different. Extra mass can be disposed of at cost C^p per unit. "rtt" is computationally the same, but the final distance is divided by the maximum of the total ink (sum of weights) in each component to the 1/p. "balanced" means the weights are normalized so that both images have the same total mass 1. Everything has to be transported, i.e.\ disposal of mass is not allowed.
size: side length of the bitmaps used for matching components (if approx = "grid).
lwd: linewidth for drawing the components in these bitmaps (if approx = "grid).
density: approximate number of discretization points per unit line length (if approx != "grid)
verbose: logical. Whether to print detailed information on the cost for all pairs of components and the final matching.
minor_warnings: logical. Should minor_warnings be given. If FALSE, the warnings about substantial distances between bitmaps/pointclouds standing for the same component and the use of a workaround due to missing strokes in component decompositions are suppressed. While these warnings indicate to same extent that things are not going exactly as planned, they are usually not of interest if a larger number of kanji distances is computed and obscure the visibility of more important warnings (if any).

Warning

The interface and details of this function will change in the future. Currently only a minimal set of parameters can be passed. The other parameters are fixed exactly as in the "prototype distance" (4.1) of the reference below for better or worse.
There is a certain tendency that exact matches of components are rather strongly favored (if the KanjiVG elements agree this can overrule the unbalanced Wasserstein distance) and the penalties for translation/scaling/distortion of components are somewhat mild.
The computation time is rather high (depending on the settings and kanji up to several seconds per kanji pair). This can be alleviated somewhat by keeping the compo_seg_depth parameters at 3 or lower and setting size = 32 (which goes well with lwd=1.8).
Future versions will use a much faster line base optimal transport algorithm and further speed-ups.

Details

For the precise definition and details see the reference below. Parameter C corresponds to \(b/2^{1/p}\) in the paper.

References

Dominic Schuhmacher (2023).
Distance maps between Japanese kanji characters based on hierarchical optimal transport.
ArXiv, tools:::Rd_expr_doi("10.48550/arXiv.2304.02493")

Examples

Run this code

if (requireNamespace("ROI.plugin.glpk")) {
  kanjidist(fivebetas[[4]], fivebetas[[5]])
  kanjidist(fivebetas[[4]], fivebetas[[5]], verbose=TRUE)
  # faster and similar:
  kanjidist(fivebetas[[4]], fivebetas[[5]], compo_seg_depth1=2, compo_seg_depth2=2, 
            size=32, lwd=1.8, verbose=TRUE) 
  # slower and similar:
  kanjidist(fivebetas[[4]], fivebetas[[5]], size=64, lwd=3.2, verbose=TRUE)
}

Run the code above in your browser using DataLab