dimcalc: Dimensionality Calculation Routines (LSA)

Description

Methods for choosing a `good' number of singular values for the dimensionality reduction in LSA.

Usage

dimcalc_share(share=0.5)
   dimcalc_ndocs(ndocs)
   dimcalc_kaiser()
   dimcalc_raw()
   dimcalc_fraction(frac=(1/50))

Arguments

Optional: a fraction of the sum of the selected singular values to the sum of all singular values (default: 0.5). Only needed by dimcalc\_share.

frac

Optional: a fraction of the number of the singular values to be used (default: 1/50th).

ndocs

Optional: the number of documents (only needed for dimcalc\_ndocs()).

Value

Returns a function that takes the singular values as a parameter to return the recommended number of dimensions. The expected parameter of this function is

A sequence of singular values (as produced by the SVD). Only needed when calling the dimensionality calculation routines directly.

Details

In an LSA process, the diagonal matrix of the singular value decomposition is usually reduced to a specific number of dimensions (also `factors' or `singular values').

The functions dimcalc\_share(), dimcalc\_ndocs(), dimcalc\_kaiser() and also the redundant function dimcalc\_raw() offer methods to calculate a useful number of singular values (based on the distribution and values of the given sequence of singular values).

All of them are tightly coupled to the core LSA functions: they generates a function to be executed by the calling (higher-level) function lsa(). The output function contains only one parameter, namely s, which is expected to be the sequence of singular values. In lsa(), the code returned is executed, the mandatory singular values are provided as a parameter within lsa().

The dimensionality calculation methods, however, can still be called directly by adding a second, separate parameter set: e.g.

dimcalc\_share(share=0.2)(mysingularvalues)

The method dimcalc\_share() finds the first position in the descending sequence of singular values s where their sum (divided by the sum of all values) meets or exceeds the specified share.

The method dimcalc\_ndocs() calculates the first position in the descending sequence of singular values where their sum meets or exceeds the number of documents.

The method dimcalc\_kaiser() calculates the number of singular values according to the Kaiser-Criterium, i.e. from the descending order of values all values with s[n] > 1 will be taken. The number of dimensions is returned accordingly.

The method dimcalc_fraction() returns the specified share of the number of singular values. Per default, 1/50th of the available values will be used and the determined number of singular values will be returned.

The method dimcalc\_raw() return the maximum number of singular values (= the length of s). It is here only for completeness.

References

Wild, F., Stahl, C., Stermsek, G., Neumann, G., Penya, Y. (2005) Parameters Driving Effectiveness of Automated Essay Scoring with LSA. In: Proceedings of the 9th CAA, pp.485-494, Loughborough

Examples

Run this code

# NOT RUN {
## create some data 
vec1 = c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 )
vec2 = c( 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0 )
vec3 = c( 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0 )
matrix = cbind(vec1,vec2, vec3)
s = svd(matrix)$d

# standard share of 0.5
dimcalc_share()(s) 

# specific share of 0.9
dimcalc_share(share=0.9)(s) 

# meeting the number of documents (here: 3)
n = ncol(matrix)
dimcalc_ndocs(n)(s)

# }

Run the code above in your browser using DataLab