TextData: Building textual and contextual tables (TextData)

Description

Creates a textual and contextual working-base (TextData format) from a source-base (data frame format).

Usage

TextData(base, var.text=NULL, var.agg=NULL, context.quali=NULL, context.quanti= NULL,
 selDoc="ALL", lower=TRUE, remov.number=TRUE,lminword=1, Fmin=1,Dmin=1, Fmax=Inf,
 stop.word.tm=FALSE, idiom="en", stop.word.user=NULL, segment=FALSE,
 sep.strong="\u005B()\u00BF?./:\u00A1!=+;{}-\u005D", seg.nfreq=10, seg.nfreq2=10,
 seg.nfreq3=10)

Arguments

base

source data frame with at least one textual column

var.text

vector with index(es) or name(s) of the selected textual column(s) (by default NULL)

var.agg

index or name of the aggregation categorical variable (by default NULL)

context.quali

vector with index(es) or name(s) of the selected categorical variable(s) (by default NULL)

context.quanti

vector with index(es) or name(s) of the selected quantitative variable(s) (by default NULL)

selDoc

vector with index(es) or name(s) of the selected source-documents (rows of the source-base) (by default "ALL")

lower

if TRUE, the corpus is converted into lowercase (by default TRUE)

remov.number

if TRUE, numbers are removed (by default TRUE)

lminword

minimum length of a word to be selected (by default 1)

Fmin

minimum frequency of a word to be selected (by default 1)

Dmin

a word has to be used in at least Dmin source-documents to be selected (by default 1)

Fmax

maximum frequency of a word to be selected (by default Inf)

stop.word.tm

if TRUE, stoplist automatically provided in accordance with the idiom (by default FALSE)

idiom

declared idiom for the textual column(s) (by default English "en", see IETF language in package NLP)

stop.word.user

stoplist provided by the user

segment

if TRUE, the repeated segments are identified (by default FALSE)

sep.strong

string with the characters marking out the repeated segments (by default "[()<U+00BF>?./:<U+00A1>!=+;-]\")

seg.nfreq

minimum frequency of a more-than-three-words-long repeated segment (by default 10)

seg.nfreq2

minimum frequency of a two-words-long repeated segment (by default 10)

seg.nfreq3

minimum frequency of a three-words-long repeated segment (by default 10)

Value

A list including:

summGen

general summary

summDoc

document summary

indexW

index of words

DocTerm

working lexical table (non-aggregate or aggregate table depending on var.agg value); working-documents by words table in slam package compressed format

context

contextual variables if context.quali or context.quanti are non-NULL; the structure greatly differs in accordance with the nature of DocTerm table (non-aggregate/ aggregate), see details

info

information about the selection of words

var.agg

a one-column data frame with the values of the aggregation variable; NULL if non-aggregate analysis

SourceTerm

in the case of DocTerm being an aggregate analysis, the source-documents by words table is kept in this data structure, in slam package compressed format

indexS

working-documents by repeated-segments table, in slam package compressed format

remov.docs

vector with the names of the removed empty source-documents

Details

Each row of the source-base is considered as a source-document. TextData function builds the working-documents-by-words table, submitted to the analysis.

Information related to context.quanti and context.quali arguments:

If numeric, contextual variables can be included in both vectors. The function TextData converts the numeric variable into factor to include it in context.quali vector. This possibility is interesting in some cases. For example, when treating open-ended questions, we can be interested in computing the correlation between the contextual variable "Age" and the axes and, at the same time, to draw the trajectory of the different values of "Age" (year by year) on the CA maps.
In the case of one or several columns with textual data not selected in vector var.text, if the argument context.quali is equal to "ALL", these columns will be considered as categorical variables.

Non-aggregate table versus aggregate table.

If var.agg=NULL:

The work-documents are the non-empty-source-documents.
DocTerm: non-aggregate lexical table with:
as many rows as non-empty source-documents
context$quali: data frame crossing the non-empty source-documents (rows) and the categorical contextual-variables (columns).
context$quanti: data frame crossing the non-empty source-documents (rows) and the quantitative contextual-variables (columns). Both contextual tables can be juxtaposed row-wise to DocTerm table.

If var.agg is NON-NULL:

The work-documents are aggregate-documents, issued from aggregating the source-documents depending on the categories of the aggregation variable; the aggregate-documents inherit the names of the corresponding categories.
DocTerm is an aggregate table with:
as many rows as as categories the aggregation variable has
context$quali$qualitable: juxtaposes as many supplementary aggregate tables as categorical contextual variables. Each table has
as many rows as categories the contextual categorical variable has
context$quali$qualivar: names of categories of the supplementary categorical variables.
context$quanti: data frame crossing the working aggregate-documents (rows) and the quantitative contextual-variables (columns). The value for an active aggregate-document is the mean-value of the source-documents belonging to this aggregate-document.

References

Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. (D. Kluwer, Ed.).

Examples

Run this code

# Non aggregate analysis
data(open.question)
res.TD<-TextData(open.question, var.text=c(9,10), remov.number=TRUE, Fmin=10, Dmin=10,  
 stop.word.tm=TRUE, context.quali=c("Gender","Age_Group","Education"), context.quanti=c("Age"))

# Aggregate analysis and repeated segments
data(open.question)
res.TD<-TextData(open.question, var.text=c(9,10), var.agg="Gen_Age", remov.number=TRUE, 
 Fmin=10, Dmin=10, stop.word.tm=TRUE, context.quali=c("Gender","Age_Group","Education"), 
 context.quanti=c("Age"), segment=TRUE)

Run the code above in your browser using DataLab