- directory
The directory to perform the search for pdf files to search.
- keyword
The keyword(s) to be used to search in the text. Multiple
keywords can be specified with a character vector.
- split_pdf
TRUE/FALSE indicating whether to split the pdf using white
space. This would be most useful with multicolumn pdf files.
The split_pdf function attempts to recreate the column layout of the text
into a single column starting with the left column and proceeding to the
right.
- surround_lines
numeric/FALSE indicating whether the output should
extract the surrouding lines of text in addition to the matching line.
Default is FALSE, if not false, include a numeric number that indicates
the additional number of surrounding lines that will be extracted.
- ignore_case
TRUE/FALSE/vector of TRUE/FALSE, indicating whether the
case of the keyword matters.
Default is FALSE meaning that case of the keyword is literal. If a vector,
must be same length as the keyword vector.
- remove_hyphen
TRUE/FALSE indicating whether hyphenated words should
be adjusted to combine onto a single line. Default is TRUE.
- token_results
TRUE/FALSE indicating whether the results text returned
should be split into tokens. See the tokenizers package and
convert_tokens
for more details. Defaults to TRUE.
- convert_sentence
TRUE/FALSE indicating if individual lines of PDF file
should be collapsed into a single large paragraph to perform keyword
searching. Default is TRUE
- split_pattern
Regular expression pattern used to split multicolumn
PDF files using stringi::stri_split_regex
.
Default pattern is "\pWHITE_SPACE3," which can be interpreted as:
split based on three or more consecutive white space characters.
- full_names
TRUE/FALSE indicating if the full file path should be used.
Default is TRUE, see list.files
for more details.
- file_pattern
An optional regular expression to select specific file
names. Only files that match the regular expression will be searched.
Defaults to all pdfs, i.e. ".pdf"
. See list.files
for more details.
- recursive
TRUE/FALSE indicating if subdirectories should be searched
as well.
Default is FALSE, see list.files
for more details.
- max_search
An optional numeric vector indicating the maximum number
of pdfs to search. Will only search the first n cases.
- ...
token_function to pass to convert_tokens
function.