lsa (version 0.73.2)

fold_in: Ex-post folding-in of textmatrices into an existing latent semantic space

Description

Additional documents can be mapped into a pre-exisiting latent semantic space without influencing the factor distribution of the space. Applied, when additional documents must not influence the calculated existing latent semantic factor structure.

Usage

fold_in( docvecs, LSAspace )

Arguments

LSAspace

a latent semantic space generated by createLSAspace.

docvecs

a textmatrix.

Value

textmatrix

a textmatrix representation of the additional documents in the latent semantic space.

Details

To keep additional documents from influencing the factor distribution calculated previously from a particular text basis, they can be folded-in after the singular value decomposition performed in lsa().

Background Information: For folding-in, a pseudo document vector mi of the new documents is calculated into as shown in the equations (1) and (2) (cf. Berry et al., 1995):

(1) \(\hat{d} = v^T T_k S_k^{-1}\)

(2) \(\hat{m} = T_k S_k \hat{d}\)

The document vector \(v^T\) in equation~(1) is identical to an additional column of an input textmatrix \(M\) with the term frequencies of the essay to be folded-in. \(T_k\) and \(S_k\) are the truncated matrices from the SVD applied through lsa() on a given text collection to construct the latent semantic space. The resulting vector \(\hat{m}\) from equation~(2) is identical to an additional column in the textmatrix representation of the latent semantic space (as produced by as.textmatrix()). Be careful when using weighting schemes: you may want to use the global weights of the training textmatrix also for your new data that you fold-in!

See Also

textmatrix, lsa, as.textmatrix

Examples

Run this code
# NOT RUN {
# create a first textmatrix with some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") )
matrix1 = textmatrix(td, minWordLength=1)
unlink(td, recursive=TRUE)

# create a second textmatrix with some more files
td = tempfile()
dir.create(td)
write( c("cat", "mouse", "mouse"), file=paste(td, "A1", sep="/") )
write( c("nothing", "mouse", "monster"), file=paste(td, "A2", sep="/") )
write( c("cat", "monster", "monster"), file=paste(td, "A3", sep="/") )
matrix2 = textmatrix(td, vocabulary=rownames(matrix1), minWordLength=1)
unlink(td, recursive=TRUE)

# create an LSA space from matrix1
space1 = lsa(matrix1, dims=dimcalc_share())
as.textmatrix(space1)

# fold matrix2 into the space generated by matrix1
fold_in( matrix2, space1)

# }

Run the code above in your browser using DataLab