keyword_clean: Automatic keyword cleaning and transfer to tidy format
Description
Carry out several keyword cleaning processes automatically and return a tidy table with
document ID and keywords.
Usage
keyword_clean(
df,
id = "id",
keyword = "keyword",
sep = ";",
rmParentheses = TRUE,
rmNumber = TRUE,
lemmatize = FALSE
)
Arguments
df
A data.frame containing at least two columns with document ID and keyword strings with separators.
id
Quoted characters specifying the column name of document ID.Default uses "id".
keyword
Quoted characters specifying the column name of keywords.Default uses "keyword".
sep
Separator(s) of keywords. Default uses ";".
rmParentheses
Remove the contents in the parentheses (including the parentheses) or not. Default
uses TRUE.
rmNumber
Remove the pure number sequence or no. Default uses TRUE.
lemmatize
Lemmatize the keywords or not. Lemmatization is supported by `lemmatize_strings` function
in `textstem` package.Default uses FALSE.
Value
A tbl with two columns, namely document ID and cleaned keywords.
Details
The entire cleaning processes include:
1.Split the text with separators;
2.Reomve the contents in the parentheses (including the parentheses);
3.Remove whitespaces from start and end of string and reduces repeated whitespaces inside a string;
4.Remove all the null character string and pure number sequences;
5.Convert all letters to lower case;
6.Lemmatization.
Some of the procedures could be suppressed or activated with parameter adjustments.
Default setting did not use lemmatization, it is suggested to use keyword_merge to
merge the keywords afterward.
Examples
Run this code# NOT RUN {
library(akc)
bibli_data_table
bibli_data_table %>%
keyword_clean(id = "id",keyword = "keyword")
# }
Run the code above in your browser using DataLab