userfriendlyscience (version 0.5-2)

detectRareWords: Looking up word frequencies

Description

This function checks, for each word in a text, how frequently it occurs in a given language. This is useful for eliminating rare words to make a text more accessible to an audience with limited vocabulary. htmlParse and xpathSApply from the XML package are used to process HTML files, if necessary. textToWords is a helper function that simply breaks down a character vector to a vector of words.

Usage

detectRareWords(textFile = NULL, wordFrequencyFile = "Dutch", output = c("file", "show", "return"), outputFile = NULL, wordCol = "Word", freqCol = "FREQlemma", textToWordsFunction = "textToWords", encoding = "ASCII", xPathSelector = "/text()", silent = FALSE) textToWords(characterVector)

Arguments

textFile
If NULL, a dialog will be shown that enables users to select a file. If not NULL, this has to be either a filename or a character vector. An HTML file can be provided; this will be parsed using
wordFrequencyFile
The file with word frequencies to use. If 'Dutch' or 'Polish', files from the Center for Reading Research (http://crr.ugent.be/) are downloaded.
output
How to provide the output, as a character vector. If file, the filename to write to should be provided in outputFile. If show, the output is shown; and if return, the output is returned invisibly.
outputFile
The name of the file to store the output in.
wordCol
The name of the column in the wordFrequencyFile that contains the words.
freqCol
The name of the column in the wordFrequencyFile that contains the frequency with which each word occurs.
textToWordsFunction
The function to use to split a character vector, where each element contains one or more words, into a vector where each element is a word.
encoding
The encoding used to read and write files.
xPathSelector
If the file provided is an HTML file, xpathSApply is used to extract the content. xPathSelector specifies which content to extract (the default value extracts all text content).
silent
Whether to suppress detailed feedback about the process.
characterVector
A character vector, the elements of which are to be broken down into words.

Value

detectRareWords return a dataframe (invisibly) if output contains return. Otherwise, NULL is returned (invisibly), but the output is printed and/or written to a file depending on the value of output.textToWords returns a vector of words.

Examples

Run this code
## Not run: 
# detectRareWords(paste('Dit is een tekst om de',
#                       'werking van de detectRareWords',
#                       'functie te demonstreren.'),
#                 output='show');
# ## End(Not run)

Run the code above in your browser using DataCamp Workspace