LSA combines the classical vector space model --- well known in
textmining --- with a Singular Value Decomposition (SVD), a two-mode
factor analysis. Thereby, bag-of-words representations of texts can
be mapped into a modified vector space that is assumed to reflect
semantic structure.
With lsa()
a new latent semantic space can
be constructed over a given document-term matrix. To ease
comparisons of terms and documents with common
correlation measures, the space can be converted into
a textmatrix of the same format as y
by calling as.textmatrix()
.
To add more documents or queries to this latent semantic
space in order to keep them from influencing the original
factor distribution (i.e., the latent semantic structure calculated
from a primary text corpus), they can be `folded-in' later on
(with the function fold_in()
).
Background information (see also Deerwester et al., 1990):
A document-term matrix \(M\) is constructed
with textmatrix()
from a given text base of \(n\) documents
containing \(m\) terms.
This matrix \(M\) of the size \(m \times n\) is then decomposed via a
singular value decomposition into: term vector matrix \(T\) (constituting
left singular vectors), the document vector matrix \(D\) (constituting
right singular vectors) being both orthonormal, and the diagonal matrix
\(S\) (constituting singular values).
\(M = TSD^T\)
These matrices are then reduced to the given number of dimensions \(k=dims\)
to result into truncated matrices \(T_{k}\), \(S_{k}\) and \(D_{k}\)
--- the latent semantic space.
\(M_k = \sum\limits_{i=1}^k t_i \cdot s_i \cdot d_i^T\)
If these matrices \(T_k, S_k, D_k\) were multiplied, they would give a new
matrix \(M_k\) (of the same format as \(M\), i.e., rows are the
same terms, columns are the same documents), which is the least-squares best
fit approximation of \(M\) with \(k\) singular values.
In the case of folding-in, i.e., multiplying new documents into a given
latent semantic space, the matrices \(T_k\) and \(S_k\) remain unchanged
and an additional \(D_k\) is created (without replacing the old one).
All three are multiplied together to return a (new and appendable)
document-term matrix \(\hat{M}\) in the term-order of \(M\).