The vocabulary of this DSM covers several basic evaluation tasks, including RG65, WordSim353 and ESSLLI08_Nouns, as well as the target nouns bank and vessel from SemCorWSD. In addition, 40 nearest neighbours each of the words white_J, apple_N, kindness_N and walk_V are included.
Co-occurrence frequency data were extracted from a collection of Web corpora with a total size of ca. 9 billion words, using a L4/R4 surface window and 30,000 lexical words as feature terms. They were scored with sparse simple log-likelihood with an additional log transformation, normalized to Euclidean unit length, and projected into 1000 latent dimensions using randomized SVD (see rsvd. For size reasons, the vectors have been compressed into 50 latent dimensions and renormalized.