lsa: Create a vector space with Latent Semantic Analysis (LSA)

Description

Calculates a latent semantic space from a given document-term matrix.

Usage

lsa( x, dims=dimcalc_share() )

Arguments

a document-term matrix (recommeded to be of class textmatrix), containing documents in colums, terms in rows and occurrence frequencies in the cells.

dims

either the number of dimensions or a configuring function.

Value

LSAspace

a list with components (\(T_k, S_k, D_k\)), representing the latent semantic space.

Details

LSA combines the classical vector space model --- well known in textmining --- with a Singular Value Decomposition (SVD), a two-mode factor analysis. Thereby, bag-of-words representations of texts can be mapped into a modified vector space that is assumed to reflect semantic structure.

With lsa() a new latent semantic space can be constructed over a given document-term matrix. To ease comparisons of terms and documents with common correlation measures, the space can be converted into a textmatrix of the same format as y by calling as.textmatrix().

To add more documents or queries to this latent semantic space in order to keep them from influencing the original factor distribution (i.e., the latent semantic structure calculated from a primary text corpus), they can be `folded-in' later on (with the function fold_in()).

Background information (see also Deerwester et al., 1990):

A document-term matrix \(M\) is constructed with textmatrix() from a given text base of \(n\) documents containing \(m\) terms. This matrix \(M\) of the size \(m \times n\) is then decomposed via a singular value decomposition into: term vector matrix \(T\) (constituting left singular vectors), the document vector matrix \(D\) (constituting right singular vectors) being both orthonormal, and the diagonal matrix \(S\) (constituting singular values).

\(M = TSD^T\)

These matrices are then reduced to the given number of dimensions \(k=dims\) to result into truncated matrices \(T_{k}\), \(S_{k}\) and \(D_{k}\) --- the latent semantic space.

\(M_k = \sum\limits_{i=1}^k t_i \cdot s_i \cdot d_i^T\)

If these matrices \(T_k, S_k, D_k\) were multiplied, they would give a new matrix \(M_k\) (of the same format as \(M\), i.e., rows are the same terms, columns are the same documents), which is the least-squares best fit approximation of \(M\) with \(k\) singular values.

In the case of folding-in, i.e., multiplying new documents into a given latent semantic space, the matrices \(T_k\) and \(S_k\) remain unchanged and an additional \(D_k\) is created (without replacing the old one). All three are multiplied together to return a (new and appendable) document-term matrix \(\hat{M}\) in the term-order of \(M\).

References

Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990) Indexing by Latent Semantic Analysis. In: Journal of the American Society for Information Science 41(6), pp. 391--407.

Landauer, T., Foltz, P., and Laham, D. (1998) Introduction to Latent Semantic Analysis. In: Discourse Processes 25, pp. 259--284.

Examples

Run this code

# NOT RUN {
# create some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
write( c("ham", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
write( c("dog", "pet", "pet"), file=paste(td, "D3", sep="/") )

# LSA
data(stopwords_en)
myMatrix = textmatrix(td, stopwords=stopwords_en)
myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix)
myLSAspace = lsa(myMatrix, dims=dimcalc_share())
as.textmatrix(myLSAspace)

# clean up
unlink(td, recursive=TRUE)

# }

Run the code above in your browser using DataLab