token statistics
token statistics
# utl <- token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL,# file_delimiter = ' ', n_gram_delimiter = "_")
token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, file_delimiter = ' ', n_gram_delimiter = "_")
--------------
path_2vector()
--------------
freq_distribution()
--------------
print_frequency(subset = NULL)
--------------
count_character()
--------------
print_count_character(number = NULL)
--------------
collocation_words()
--------------
print_collocations(word = NULL)
--------------
string_dissimilarity_matrix(dice_n_gram = 2, method = "dice", split_separator = " ", dice_thresh = 1.0, upper = TRUE, diagonal = TRUE, threads = 1)
--------------
look_up_table(n_grams = NULL)
--------------
print_words_lookup_tbl(n_gram = NULL)
new()token_stats$new(
x_vec = NULL,
path_2folder = NULL,
path_2file = NULL,
file_delimiter = "\n",
n_gram_delimiter = "_"
)x_veceither NULL or a string character vector
path_2foldereither NULL or a valid path to a folder (each file in the folder should include words separated by a delimiter)
path_2fileeither NULL or a valid path to a file
file_delimitereither NULL or a character string specifying the file delimiter
n_gram_delimitereither NULL or a character string specifying the n-gram delimiter. It is used in the collocation_words function
path_2vector()token_stats$path_2vector()
freq_distribution()token_stats$freq_distribution()
print_frequency()token_stats$print_frequency(subset = NULL)subseteither NULL or a vector specifying the subset of data to keep (number of rows of the print_frequency function)
count_character()token_stats$count_character()
print_count_character()token_stats$print_count_character(number = NULL)numbera numeric value for the print_count_character function. All words with number of characters equal to the number parameter will be returned.
collocation_words()token_stats$collocation_words()
print_collocations()token_stats$print_collocations(word = NULL)worda character string for the print_collocations and print_prob_next functions
string_dissimilarity_matrix()token_stats$string_dissimilarity_matrix(
dice_n_gram = 2,
method = "dice",
split_separator = " ",
dice_thresh = 1,
upper = TRUE,
diagonal = TRUE,
threads = 1
)dice_n_grama numeric value specifying the n-gram for the dice method of the string_dissimilarity_matrix function
methoda character string specifying the method to use in the string_dissimilarity_matrix function. One of dice, levenshtein or cosine.
split_separatora character string specifying the string split separator if method equal cosine in the string_dissimilarity_matrix function. The cosine method uses sentences, so for a sentence : "this_is_a_word_sentence" the split_separator should be "_"
dice_thresha float number to use to threshold the data if method is dice in the string_dissimilarity_matrix function. It takes values between 0.0 and 1.0. The closer the thresh is to 0.0 the more values of the dissimilarity matrix will take the value of 1.0.
uppereither TRUE or FALSE. If TRUE then both lower and upper parts of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the upper part will be filled with NA's
diagonaleither TRUE or FALSE. If TRUE then the diagonal of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the diagonal will be filled with NA's
threadsa numeric value specifying the number of cores to use in parallel in the string_dissimilarity_matrix function
look_up_table()token_stats$look_up_table(n_grams = NULL)n_gramsa numeric value specifying the n-grams in the look_up_table function
print_words_lookup_tbl()token_stats$print_words_lookup_tbl(n_gram = NULL)n_grama character string specifying the n-gram to use in the print_words_lookup_tbl function
clone()The objects of this class are cloneable with this method.
token_stats$clone(deep = FALSE)deepWhether to make a deep clone.
the path_2vector function returns the words of a folder or file to a vector ( using the file_delimiter to input the data ). Usage: read a vocabulary from a text file
the freq_distribution function returns a named-unsorted vector frequency_distribution in R for EITHER a folder, a file OR a character string vector. A specific subset of the result can be retrieved using the print_frequency function
the count_character function returns the number of characters for each word of the corpus for EITHER a folder, a file OR a character string vector. A specific number of character words can be retrieved using the print_count_character function
the collocation_words function returns a co-occurence frequency table for n-grams for EITHER a folder, a file OR a character string vector. A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ( http://nlp.stanford.edu/fsnlp/promo/colloc.pdf, page 172 ). The input to the function should be text n-grams separated by a delimiter (for instance 3- or 4-ngrams ). I can retrieve a specific frequency table by using the print_collocations function
the string_dissimilarity_matrix function returns a string-dissimilarity-matrix using either the dice, levenshtein or cosine distance. The input can be a character string vector only. In case that the method is dice then the dice-coefficient (similarity) is calculated between two strings for a specific number of character n-grams ( dice_n_gram ).
the look_up_table returns a look-up-list where the list-names are the n-grams and the list-vectors are the words associated with those n-grams. The words for each n-gram can be retrieved using the print_words_lookup_tbl function. The input can be a character string vector only.
library(textTinyR)
expl = c('one_word_token', 'two_words_token', 'three_words_token', 'four_words_token')
tk <- token_stats$new(x_vec = expl, path_2folder = NULL, path_2file = NULL)
#-------------------------
# frequency distribution:
#-------------------------
tk$freq_distribution()
# tk$print_frequency()
#------------------
# count characters:
#------------------
cnt <- tk$count_character()
# tk$print_count_character(number = 4)
#----------------------
# collocation of words:
#----------------------
col <- tk$collocation_words()
# tk$print_collocations(word = 'five')
#-----------------------------
# string dissimilarity matrix:
#-----------------------------
dism <- tk$string_dissimilarity_matrix(method = 'levenshtein')
#---------------------
# build a look-up-table:
#---------------------
lut <- tk$look_up_table(n_grams = 3)
# tk$print_words_lookup_tbl(n_gram = 'e_w')
Run the code above in your browser using DataLab