Learn R Programming

Xplortext (version 1.00)

TextData: Building textual and contextual tables (TextData)

Description

Creates a textual and contextual working-base (TextData format) from a source-base (data frame format).

Usage

TextData(base, var.text=NULL, var.agg=NULL, context.quali=NULL, context.quanti= NULL,
 selDoc="ALL", lower=TRUE, remov.number=TRUE,lminword=1, Fmin=1,Dmin=1, Fmax=Inf,
 stop.word.tm=FALSE, idiom="en", stop.word.user=NULL, segment=FALSE,
 sep.strong="\u005B()\u00BF?./:\u00A1!=+;{}-\u005D", seg.nfreq=10, seg.nfreq2=10,
 seg.nfreq3=10)

Arguments

base
source data frame with at least one textual column
var.text
vector with index(es) or name(s) of the selected textual column(s) (by default NULL)
var.agg
index or name of the aggregation categorical variable (by default NULL)
context.quali
vector with index(es) or name(s) of the selected categorical variable(s) (by default NULL)
context.quanti
vector with index(es) or name(s) of the selected quantitative variable(s) (by default NULL)
selDoc
vector with index(es) or name(s) of the selected source-documents (rows of the source-base) (by default "ALL")
lower
if TRUE, the corpus is converted into lowercase (by default TRUE)
remov.number
if TRUE, numbers are removed (by default TRUE)
lminword
minimum length of a word to be selected (by default 1)
Fmin
minimum frequency of a word to be selected (by default 1)
Dmin
a word has to be used in at least Dmin source-documents to be selected (by default 1)
Fmax
maximum frequency of a word to be selected (by default Inf)
stop.word.tm
if TRUE, stoplist automatically provided in accordance with the idiom (by default FALSE)
idiom
declared idiom for the textual column(s) (by default English "en", see IETF language in package NLP)
stop.word.user
stoplist provided by the user
segment
if TRUE, the repeated segments are identified (by default FALSE)
sep.strong
string with the characters marking out the repeated segments (by default "[()<U+00BF>?./:<U+00A1>!=+;-]\")
seg.nfreq
minimum frequency of a more-than-three-words-long repeated segment (by default 10)
seg.nfreq2
minimum frequency of a two-words-long repeated segment (by default 10)
seg.nfreq3
minimum frequency of a three-words-long repeated segment (by default 10)

Value

A list including:
summGen
general summary
summDoc
document summary
indexW
index of words
DocTerm
working lexical table (non-aggregate or aggregate table depending on var.agg value); working-documents by words table in slam package compressed format
context
contextual variables if context.quali or context.quanti are non-NULL; the structure greatly differs in accordance with the nature of DocTerm table (non-aggregate/ aggregate), see details
info
information about the selection of words
var.agg
a one-column data frame with the values of the aggregation variable; NULL if non-aggregate analysis
SourceTerm
in the case of DocTerm being an aggregate analysis, the source-documents by words table is kept in this data structure, in slam package compressed format
indexS
working-documents by repeated-segments table, in slam package compressed format
remov.docs
vector with the names of the removed empty source-documents

Details

Each row of the source-base is considered as a source-document. TextData function builds the working-documents-by-words table, submitted to the analysis.

Information related to context.quanti and context.quali arguments:

  1. If numeric, contextual variables can be included in both vectors. The function TextData converts the numeric variable into factor to include it in context.quali vector. This possibility is interesting in some cases. For example, when treating open-ended questions, we can be interested in computing the correlation between the contextual variable "Age" and the axes and, at the same time, to draw the trajectory of the different values of "Age" (year by year) on the CA maps.
  2. In the case of one or several columns with textual data not selected in vector var.text, if the argument context.quali is equal to "ALL", these columns will be considered as categorical variables.

Non-aggregate table versus aggregate table.

If var.agg=NULL:

  1. The work-documents are the non-empty-source-documents.
  2. DocTerm: non-aggregate lexical table with:
    as many rows as non-empty source-documents
  3. context$quali: data frame crossing the non-empty source-documents (rows) and the categorical contextual-variables (columns).
  4. context$quanti: data frame crossing the non-empty source-documents (rows) and the quantitative contextual-variables (columns). Both contextual tables can be juxtaposed row-wise to DocTerm table.

If var.agg is NON-NULL:

  1. The work-documents are aggregate-documents, issued from aggregating the source-documents depending on the categories of the aggregation variable; the aggregate-documents inherit the names of the corresponding categories.
  2. DocTerm is an aggregate table with:
    as many rows as as categories the aggregation variable has
  3. context$quali$qualitable: juxtaposes as many supplementary aggregate tables as categorical contextual variables. Each table has
    as many rows as categories the contextual categorical variable has
  4. context$quali$qualivar: names of categories of the supplementary categorical variables.
  5. context$quanti: data frame crossing the working aggregate-documents (rows) and the quantitative contextual-variables (columns). The value for an active aggregate-document is the mean-value of the source-documents belonging to this aggregate-document.

References

Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. (D. Kluwer, Ed.).

See Also

print.TextData, summary.TextData, plot.TextData

Examples

Run this code
# Non aggregate analysis
data(open.question)
res.TD<-TextData(open.question, var.text=c(9,10), remov.number=TRUE, Fmin=10, Dmin=10,  
 stop.word.tm=TRUE, context.quali=c("Gender","Age_Group","Education"), context.quanti=c("Age"))

# Aggregate analysis and repeated segments
data(open.question)
res.TD<-TextData(open.question, var.text=c(9,10), var.agg="Gen_Age", remov.number=TRUE, 
 Fmin=10, Dmin=10, stop.word.tm=TRUE, context.quali=c("Gender","Age_Group","Education"), 
 context.quanti=c("Age"), segment=TRUE)

Run the code above in your browser using DataLab