tokenize_text_by_language

Language-aware tokenizer used across embedders and keyword search

internal

A lightweight vector database for text retrieval in R with embedded
machine learning models and no external API (Application Programming
Interface) keys. Supports dense and hybrid search, optional HNSW
(Hierarchical Navigable Small World) approximate nearest-neighbor indexing,
faceted filters with ACL (Access Control List) metadata, command-line
tools, and a local dashboard built with 'shiny'. The HNSW method is
described by Malkov and Yashunin (2018) <doi:10.1109/TPAMI.2018.2889473>.

Kwadwo Daddy Nyame Owusu Boakye

VectrixDB

Lightweight Vector Database with Embedded Machine Learning
Models

tokenize_text_by_language function

<dl><dt>text</dt>
<dd>Input text</dd>
<dt>language</dt>
<dd>"en" or "ml"</dd>
<dt>remove_stopwords</dt>
<dd>Remove English stopwords when language is "en"</dd></dl>

Arguments

Language-aware tokenizer used across embedders and keyword search — tokenize_text_by_language

<dl>

<dt>text</dt>
<dd>Input text</dd>


<dt>language</dt>
<dd>"en" or "ml"</dd>


<dt>remove_stopwords</dt>
<dd>Remove English stopwords when language is "en"</dd>

</dl>

tokenize_text_by_language: Language-aware tokenizer used across embedders and keyword search

Description

Usage

Value

Arguments