corporaexplorer (version 0.6.2)

prepare_data: Prepare data for corpus exploration

Description

Prepare data for corpus exploration

Usage

prepare_data(dataset, ...)

# S3 method for data.frame prepare_data(dataset, date_based_corpus = TRUE, grouping_variable = NULL, columns_doc_info = c("Date", "Title", "URL"), corpus_name = NULL, use_matrix = TRUE, normalise = TRUE, matrix_without_punctuation = TRUE, ...)

Arguments

dataset

Object to be converted to a corporaexplorerobject. Converts a data frame with a column "Text" (class character), and optionally other columns. If date_based_corpus is TRUE (the default), dataset must contain a column "Date" (of class Date).

...

Ignored.

date_based_corpus

Logical. Set to FALSE if the corpus is not to be organised according to document dates.

grouping_variable

Character string. If date_based_corpus is TRUE, this argument is ignored. If date_based_corpus is FALSE, this argument can be used to group the documents, e.g. if dataset is organised by chapters belonging to different books.

columns_doc_info

Character vector. The columns from df to display in the "document information" tab in the corpus exploration app. By default "Date", "Title" and "URL" will be displayed, if included. If columns_doc_info includes a column which is not present in dataset, it will be ignored.

corpus_name

Character string with name of corpus.

use_matrix

Logical. Should the function create a document term matrix for fast searching? If TRUE, data preparation will run longer and demand more memory. If FALSE, the returning corporaexplorerobject will be more light-weight, but searching will be slower.

normalise

Should non-breaking spaces (U+00A0) and soft hyphens (U+00ad) be normalised?

matrix_without_punctuation

Should punctuation and digits be stripped from the text before constructing the document term matrix? If TRUE, the default:

  • The corporaexplorer object will be lighter and most searches in the corpus exploration app will be faster.

  • Searches including punctuation and digits will be carried out in the full text documents.

  • The only "risk" with this strategy is that the corpus exploration app in some cases can produce false positives. E.g. searching for the term "donkey" will also find the term "don%key". This should not be a problem for the vast opportunity of use cases, but if one so desires, there are three different solutions: set this parameter to FALSE, create a corporaexplorerobject without a matrix by setting the use_matrix parameter to FALSE, or run run_corpus_explorer with the use_matrix parameter set to FALSE.

If FALSE, the corporaexplorer object will be larger, and most simple searches will be slower.

Value

A corporaexplorer object to be passed as argument to run_corpus_explorer and run_document_extractor.

Details

Each row in dataset is treated as a base differentiating unit in the corpus, typically chapters in books, or a single document in document collections.

The following column names are reserved and cannot be used in dataset: "ID", "Text_original_case", "Tile_length", "Year", "Seq", "Weekday_n", "Day_without_docs", "Invisible_fake_date", "Tile_length".

Examples

Run this code
# NOT RUN {
# Constructing test data frame:
dates <- as.Date(paste(2011:2020, 1:10, 21:30, sep = "-"))
texts <- paste0(
  "This is a document about ", month.name[1:10], ". ",
  "This is not a document about ", rev(month.name[1:10]), "."
)
titles <- paste("Text", 1:10)
test_df <- tibble::tibble(Date = dates, Text = texts, Title = titles)

# Converting to corporaexplorer object:
corpus <- prepare_data(test_df, corpus_name = "Test corpus")

if(interactive()){
# Running exploration app:
run_corpus_explorer(corpus)

# Running app to extract documents:
run_document_extractor(corpus)
}
# }

Run the code above in your browser using DataCamp Workspace