unigram_sequence_segmentation: Segmenting sequences with unigrams

Description

unigram_sequence_segmentation segments input sequence into possible segmented text based on unigram sequence segmentation approach.

Usage

unigram_sequence_segmentation(
  sequences,
  unigram_dictionary = NUSS::base_dictionary,
  retrieve = "most-scored",
  simplify = TRUE,
  omit_zero = TRUE,
  score_formula = "points / words.number ^ 2"
)

Value

The output always will be data.frame. If retrieve='all'

is used, then the return will include all possible segmentation of the given sequence.

If retrieve='first-shortest' is used, the first of the shortest segmentations (with respect to the order of word's appearance in the dictionary, 1 row).

If retrieve='most-pointed' is used, segmentation with most total points is returned (1 row).

If retrieve='most-scored' is used, segmentation with the highest score calculated as

\(score = points / words.number ^ 2\) (or as specified by the user).

The output is not in the input order. If needed, use lapply

Arguments

sequences: character vector, sequence to be segmented (e.g., hashtag). Case-sensitive.
unigram_dictionary: data.frame, containing ids, words to search, words to use for segmentation, and their points. See details.
retrieve: character vector of length 1, the type of the result data.frame to be returned: 'all', 'first-shortest', 'most-pointed' or 'most-scored'. See value section.
simplify: logical, if adjacent numbers should be merged into one, and underscores removed. See simplification section.
omit_zero: logical, if words with 0 points should be omitted from word count. See simplification section.
score_formula: character vector of length 1, with formula to calculate score.

unigram_dictionary

Dictionary has to be data.frame with four named columns: 1) to_search, 2) to_replace, 3) id, 4) points.
'to_search' should be column of type character, containing unigram to look for. Word case might be used.
'to_replace' should be column of type character, containing word that should be used for creating segmentation vector, if 'to_search' matches text.
'id' should be column of type numeric, containing id of unigram.
'points' should be column of type numeric, containing number of points for the word - the higher, the better. Unigrams with 0 points might be removed from the word count with omit_zero argument.

Simplification

Two arguments are possible for simplification:

simplify - removes spaces between numbers and removes underscores,
omit_zero - removes ids of 0-pointed unigrams, and omits them in the word count.
By default segmented sequence will be simplified, and numbers and underscores will be removed from word count for score computing, since they are neutral as they are necessary.

Details

This function is not intended for long strings segmentation - 70 characters should be considered too long and may take hours to complete. 15 characters takes about 0.02s, 30 characters about 0.03s.

Examples

Run this code

# With custom dictionary
texts <- c("this is science",
           "science is #fascinatingthing",
           "this is a scientific approach",
           "science is everywhere",
           "the beauty of science")
udict <- unigram_dictionary(texts)
unigram_sequence_segmentation('thisisscience', udict)

# With built-in dictionary (English, only lowercase)
unigram_sequence_segmentation('thisisscience')
unigram_sequence_segmentation('thisisscience2024')
unigram_sequence_segmentation('thisisscience2024', simplify=FALSE, omit_zero=FALSE)

Run the code above in your browser using DataLab