An automated cleaning function for spell-checking, de-pluralizing, removing duplicates, and binarizing text data
textcleaner(data, miss = 99, partBY = c("row", "col"),
dictionary = NULL, tolerance = 1)
Matrix or data frame.
A dataset of text data.
Participant IDs should be made to be row or
column names to specify whether participants
are by row or column (see argument partBY
).
If no IDs are provided, then their order in the corresponding
row (or column is used).
A message will notify the user how IDs were assigned
Numeric or character.
Value for missing data.
Defaults to 99
Character.
Are participants by row or column?
Set to "row"
for by row.
Set to "col"
for by column
Character vector.
Can be a vector of a corpus or any text for comparison.
Dictionary to be used for more efficient text cleaning.
Defaults to NULL
, which will use general.dictionary
.
Use dictionaries()
or find.dictionaries()
for more options
(See SemNetDictionaries
for more details)
Numeric.
The distance tolerance set for automatic spell-correction purposes.
This function uses the function stringdist
to compute the Damerau-Levenshtein
(DL) distance, which is used to determine potential best guesses.
Unique words (i.e., n = 1) that are within the (distance) tolerance are
automatically output as best guess responses, which are then passed through
word.check.wrapper
. If there is more than one word
that is within or below the distance tolerance, then these will be provided as potential
options.
The recommended and default distance tolerace is tolerance = 1
,
which only spell corrects a word if there is only one word with a DL distance of 1.
This function returns a list containing the following objects:
A matrix of responses where each row represents a participant
and each column represents a unique response. A response that a participant has provided is a '1
'
and a response that a participant has not provided is a '0
'
A response matrix that has been spell-checked and de-pluralized with duplicates removed. This can be used as a final dataset for analyses (e.g., fluency of responses)
A list containing two objects: full
and unique
. full
contains
all responses regardless of spellcheck changes and unique
contains only responses that were
changed during the spell-check
A list containing two objects: rows
and ids
.
rows
identifies removed participants by their row (or column) location in the original data file
and ids
identifies removed participants by their ID (see argument data
)
A list where each participant is a list index with each
response that was been changed. Participants are identified by their ID (see argument data
).
This can be used to replicate the cleaning process and to keep track of changes more generaly.
Participants with NA
did not have any changes from their original data
and participants with missing data are removed (see removed$ids
)
When working through the menu options in textcleaner
,
there may be mistakes. For instance, selecting to REMOVE
a response when really
all you wanted to do was RENAME
a response. There are a couple of options:
RECOMMENDED
1. You can make a note in your R
script for the change you wanted
to make (you can keep moving through the cleaning process).
After the cleaning process is through, you can check the spellcheck$unique
output of textcleaner
to see what changes
you made. To correct any changes you made in the cleaning process,
you can use the corr.chn
function
NOT RECOMMENDED
2. You can use esc
to exit out of a menu selection process.
This is NOT recommended because you will lose all changes that
you've made up to that point
Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3, 22-28. doi:10.32614/RJ-2011-014
# NOT RUN {
#load trial data
data <- trial
# }
# NOT RUN {
rmat <- textcleaner(data, partBY = "col")
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab