powered by
split_text splits texts into blocks of a maximum number of bytes.
split_text
split_text(text, max_size_bytes = 29000, tokenize = "sentences")
Returns a (tibble) with the following columns:
tibble
text_id position of the text in the character vector.
text_id
segment_id ID of a text segment.
segment_id
segment_text text segment that is smaller than max_size_bytes
segment_text
max_size_bytes
character vector to be split.
maximum size of a single text segment in bytes.
level of tokenization. Either "sentences" or "words".
The function uses tokenizers::tokenize_sentences to split texts.
tokenizers::tokenize_sentences
if (FALSE) { # Split long text text <- paste0(rep("This is a very long text.", 10000), collapse = " ") split_text(text) }
Run the code above in your browser using DataLab