LexChar: Characteristic words and documents (LexChar)

Description

Measure of the association between vocabulary or words and quantitative or qualitative contextual variables.

Usage

LexChar(object, proba=0.05, maxCharDoc=10, maxPrnDoc=100, 
              marg.doc="before",  context=NULL, correct=TRUE, nbsample=500,
              seed=12345,...)

Value

Returns a list including:

CharWord: characteristic words of all the documents
stats: association statistics of the lexical table
CharDoc: characteristic source-documents of all the aggregate-documents including qualitative contextual variables
Vocab: characteristic quantitative and qualitative variables of the words. CharWord and stats are provided.

Arguments

object: TextData, DocumentTermMatrix, dataframe or matrix object
proba: threshold on the p-value used when selecting the characteristic words (by default 0.05)
maxCharDoc: maximum number of characteristic source-documents to extract (by default 10). See details
maxPrnDoc: maximum length to be printed for a characteristic document (by default 100 characters)
marg.doc: if after/before, frequencies after/before TextData selection are used as document weighting (by default "before"); if before.RW all words under threshold in TextData function are included as a new word named RemovedWords
context: name of quantitative or qualitative variables
correct: if TRUE, pvalue correction test is applied for quantitative contextual variables (by default TRUE)
nbsample: number of samples drawn to evaluate the pvalues in quantitative contextual variables
seed: Seed to obtain the same results using permutation tests (by default 12345)
...: further arguments passed to or from other methods

Author

Monica Bécue-Bertaut, Ramón Alvarez-Esteban ramon.alvarez@unileon.es, Josep-Antón Sánchez-Espigares, Belchin Kostov

Details

The lexical table provided by TextData can consider either source-documents or aggregate-documents, in accordance with the value of argument "var.agg" in TextData. Context cualitative variables allow to aggregate documents by combining the categories of the qualitative variables and the aggregation variable if any.

Extracting the characteristic words (CharWord) for a too high number of documents is of no interest and time-consuming.

In any case, only the first maxPrnDoc characters of each characteristic document are printed (by default 100).

In the case of the association between words and qualitative variables, the usual characteristic words are provided.

quali$CharWord provides the qualitative variables (including the aggregation variable) and their categories. quali$stats provides association statistics for vocabulary and qualitative variables (including the aggregation variable). quali$CharDoc provides characteristic source-documents for the categories. quanti$CharWord provides characteristic quantitative variables for each word. If there are aggregation variable and/or qualitative contextual variable, from aggregated lexical table. quanti$stats provides statistics for vocabulary and quantitative variables. If there are aggregation variable and/or qualitative contextual variable, from aggregated lexical table.

If the lexical table (object) is not a TextData object, context argument can be columns of the same dataframe. The aggregate lexical table is constructed from the combinations of the categories of the qualitative variables (including the aggregation variable).

References

Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. (D. Kluwer, Ed.). tools:::Rd_expr_doi("10.1007/978-94-017-1525-6").

Examples

Run this code

data(open.question)
 res.TD<-TextData(open.question, var.text=c(9,10), var.agg="Gen_Edu", Fmin=10, Dmin=10,
                   remov.number=TRUE, stop.word.tm=TRUE)
 res.LexChar <-LexChar(res.TD)
 summary(res.LexChar)

Run the code above in your browser using DataLab