get_ngrams

A character vector from which to extract n-grams.

Numeric: the minimum number of terms in an n-gram.

Numeric: the minimum number of times an n-gram must occur to be returned.

min_freq

Numeric: what quantile of ngrams should be retained. Defaults to 0.8; i.e. the 80th percentile of ngram frequencies.

ngram_quantile

A character vector of stopwords to ignore.

stop_words

Logical: should punctuation be removed before selecting ngrams?

rm_punctuation

A character vector of punctuation marks to be retained if rm_punctuation is TRUE.

preserve_chars

A string indicating the language to use for removing stopwords.

language

This function extracts n-grams from text.

A suite of tools are provided here to support authors in making their research more discoverable.
check_keywords() - this function checks the keywords to assess whether they are already represented in the
title and abstract. check_fields() - this function compares terminology used across the title, abstract
and keywords to assess where terminological diversity (i.e. the use of synonyms) could increase the likelihood
of the record being identified in a search. The function looks for terms in the title and abstract that also
exist in other fields and highlights these as needing attention. suggest_keywords() - this function takes a
full text document and produces a list of unigrams, bigrams and trigrams (1-, 2- or 2-word phrases)
present in the full text after removing stop words (words with a low utility in natural language processing)
that do not occur in the title or abstract that may be suitable candidates for keywords. suggest_title() -
this function takes a full text document and produces a list of the most frequently used unigrams, bigrams
and trigrams after removing stop words that do not occur in the abstract or keywords that may be suitable
candidates for title words. check_title() - this function carries out a number of sub tasks: 1) it compares
the length (number of words) of the title with the mean length of titles in major bibliographic databases to
assess whether the title is likely to be too short; 2) it assesses the proportion of stop words in the title
to highlight titles with low utility in search engines that strip out stop words; 3) it compares the title
with a given sample of record titles from an .ris import and calculates a similarity score based on phrase
overlap. This highlights the level of uniqueness of the title. This version of the package also contains
functions currently in a non-CRAN package called 'litsearchr' <https://github.com/elizagrames/litsearchr>.

get_ngrams: Extract n-grams from text

Description

Usage

Arguments

Value

Examples