kKNN: Adaptive k-Nearest Neighbor Classifier
Project Overview
The LCCkNN R package implements an adaptive k-nearest neighbor (k-NN) classifier based on local curvature estimation. Unlike the traditional k-NN algorithm, which uses a fixed number of neighbors ($k$) for all data points, the kK-NN algorithm dynamically adjusts the neighborhood size for each sample.
The core idea is that data points with low curvature could have larger neighborhoods, as the tangent space approximates well the underlying data shape. Conversely, points with high curvature could have smaller neighborhoods, because the tangent space is a loose approximation. This adaptive strategy is capable of avoiding both underfitting and overfitting, and improves classification performance, especially when dealing with a limited number of training samples. The algorithm's key components include:
- Curvature Estimation: The local Gaussian curvature is estimated by approximating the local shape operator in terms of the local covariance matrix and the local Hessian matrix. This approximation for the local shape operator of the data manifold improves the robustness to noise and outliers.
- Adaptive Neighborhood Sizing: The curvatures are quantized into ten different scores. Based on these scores, the adaptive neighborhood adjustment is performed by pruning the edges of the k-NN graph, reducing the neighborhood size for high-curvature points and retaining a larger neighborhood for low-curvature points.
- Improved Performance: Results on many real-world datasets indicate that the kK-NN algorithm yields superior balanced accuracy compared to the established k-NN method and another adaptive k-NN algorithm. The results consistently show that the kK-NN classifier is superior to the regular k-NN and also the competing adaptive k-NN algorithm.
Key Feature: Dual Quantization Support
This package offers two distinct quantization methods for adaptive neighborhood sizing:
Paper Method: The default approach, which quantizes curvatures into ten different scores (from 0 to 9), as described in the original paper.
Log2n Method: A slightly modified approach, also developed by the same author, that quantizes curvatures into k scores, where k is the number of neighbors, which is calculated as k= log2n.
This dual-method support allows users to test and compare both quantization strategies within a single package.