text_file_parser

a character string specifying the path to the input file

input_path_file

a character string specifying the path to the output file

output_path_file

a character string. The start_query is the first word of the subset of the data and should appear frequently at the beginning of each line in the text file.

start_query

a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file.

end_query

a numeric value specifying the minimum number of lines. For instance if min_lines = 2, then only subsets of text with more than 1 lines will be kept.

min_lines

either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query

trimmed_line

either TRUE or FALSE. If TRUE then information will be printed in the console

verbose

Processes big text data files in batches efficiently. For this purpose, it offers functions for splitting, parsing, tokenizing and creating a vocabulary. Moreover, it includes functions for building either a document-term matrix or a term-document matrix and extracting information from those (term-associations, most frequent terms). Lastly, it embodies functions for calculating token statistics (collocations, look-up tables, string dissimilarities) and functions to work with sparse matrices. The source code is based on 'C++11' and exported in R through the 'Rcpp', 'RcppArmadillo' and 'BH' packages.

text_file_parser: text file parser

Description

Usage

Arguments

Details

Examples