splitWordlist: Construct sparse matrices from comparative wordlists (aka `Swadesh list')

Description

A comparative wordlist (aka `Swadesh list') is a collection of wordforms from different languages, which are translations of a selected set of meanings. This function dismantles this datastructure into a set of psarse matrices.

Usage

splitWordlist(data,
	doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART",
	splitstrings = TRUE, sep =  "", bigram.binder = "", grapheme.binder = "_", 
	simplify = FALSE)

Arguments

Value

There are four different possible outputs, depending on the option chosen.

By default, when splitstrings = T, simplify = F, the following list of 15 objects is returned. It starts with 8 different character vectors, which are actually the row/column names of the following sparse pattern matrices. The naming of the objects is an attempt to make everything easy to remember.doculectsCharacter vector with names of doculects in the dataconceptsCharacter vector with names of concepts in the datawordsCharacter vector with all words, i.e. unique counterparts per language. The same string in the same language is only included once, but an identical string occurring in different doculect is separately included for each doculects.segmentsCharacter vector with all unigram-tokens in order of appearance, including boundary symbols and gap symbols (see splitStrings for more information about the gap symbols)unigramsCharacter vector with all unique unigrams in the databigramsCharacter vector with all unique bigrams in the datagraphemesCharacter vector with all unique graphemes (i.e. combinations of unigrams+doculects) occurring in the datadigraphsCharacter vector with all unique digraphs (i.e. combinations of bigrams+doculects) occurring in the dataDWSparse pattern matrix of class ngCMatrix linking doculects (D) to words (W)CWSparse pattern matrix of class ngCMatrix linking concepts (C) to words (W)SWSparse pattern matrix of class ngCMatrix linking all token-segments (S) to words (W)USSparse pattern matrix of class ngCMatrix linking unigrams (U) to segments (S)BSSparse pattern matrix of class ngCMatrix linking bigrams (B) to segments (S)GSSparse pattern matrix of class ngCMatrix linking language-specific graphemes (G) to segments (S)TSSparse pattern matrix of class ngCMatrix linking digraphs (T, as no other letter was available) to segments (S)When splitstrings = F, simplify = F, only the following objects from the above list are returned:doculectsCharacter vector with names of doculects in the dataconceptsCharacter vector with names of concepts in the datawordsCharacter vector with all words, i.e. unique counterparts per language. The same string in the same language is only included once, but an identical string occurring in different doculect is separately included for each doculects.DWSparse pattern matrix of class ngCMatrix linking doculects (D) to words (W)CWSparse pattern matrix of class ngCMatrix linking concepts (C) to words (W)When splitstrings = T, simplify = T only the bigram-separation is returned, and all row and columns names are included into the matrices. However, for reasons of space, the words vector is only included once:DWSparse pattern matrix of class ngCMatrix linking doculects (D) to words (W). Doculects are in the rownames, colnames are left empty.CWSparse pattern matrix of class ngCMatrix linking concepts (C) to words (W). Concepts are in the rownames, colnames are left empty.BWSparse pattern matrix of class ngCMatrix linking bigrams (B) to words (W). Bigrams (note: not digraphs!) are in the rownames. This matrix includes all words as colnames.Finally, when splitstrings = F, simplify = T, only the following subset of the above is returned.DWSparse pattern matrix of class ngCMatrix linking doculects (D) to words (W). Doculects are in the rownames, colnames are left empty.CWSparse pattern matrix of class ngCMatrix linking concepts (C) to words (W). Concepts are in the rownames, colnames are left empty.

Details

The meanings that are selected for a wordlist are called CONCEPTS here, and the translations into the various languages COUNTERPARTS (following Poornima & Good 2010). The languages are called DOCULECTS (`documented lects') to generalize over their status as dialects, languages, or even small families (following Cysouw & Good 2013).

References

Cysouw, Michael & Jeff Good. 2013. Languoid, Doculect, Glossonym: Formalizing the notion “language”. Language Documentation and Conservation 7. 331-359.

Poornima, Shakthi & Jeff Good. 2010. Modeling and Encoding Traditional Wordlists for Machine Applications. Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground.

Examples

Run this code

# ----- load data -----

# an example wordlist, see the help(huber) for details
data(huber)

# ----- show output -----

# a selection, to see the result of splitWordlist
# only show the simplified output here, 
# the full output is rather long even for just these six words
sel <- c(1:3, 1255:1258)
splitWordlist(huber[sel,], simplify = TRUE)

# ----- split complete data -----

# splitting the complete wordlist is a lot of work !
# it won't get much quicker than this
# most time goes into the string-splitting of the almost 26,000 words
# Default version, included splitStrings:
system.time( H <- splitWordlist(huber) )

# Simplified version without splitStrings is much quicker:
system.time( H <- splitWordlist(huber, splitstrings = FALSE, simplify = TRUE) )

# ----- investigate colexification -----

# The simple version can be used to check how often two concepts 
# are expressed identically across all languages ('colexification')
H <- splitWordlist(huber, splitstrings = FALSE, simplify = TRUE)
sim <- tcrossprod(H$CW*1)

# select only the frequent colexifications for a quick visualisation
diag(sim) <- 0
sim <- drop0(sim, tol = 5)
sim <- sim[rowSums(sim) > 0, colSums(sim) > 0]
plot( hclust(as.dist(-sim), method = "average"), cex = .5)

# ----- investigate regular sound correspondences -----

# One central problem with data from many languages is the variation of orthography
# It is preferred to solve that problem separately
# e.g. check the column "TOKENS" in the huber data
# This is a grapheme-separated version of the data.
# can be used to investigate co-occurrence of graphemes (approx. phonemes)
H <- splitWordlist(huber, counterparts = "TOKENS", sep = " ")

# co-occurrence of all pairs of the 2150 different graphemes through all languages
system.time( G <- assocSparse( (H$CW*1) %*% t(H$SW*1) %*% t(H$GS*1), method = poi))
rownames(G) <- colnames(G) <- H$graphemes
G <- drop0(G, tol = 1)

# select only one language pair for a quick visualisation
# check the nice sound changes between bora and muinane!
GD <- H$GS %*% H$SW %*% t(H$DW)
colnames(GD) <- H$doculects
correspondences <- G[GD[,"bora"],GD[,"muinane"]]
heatmap(as.matrix(correspondences))

Run the code above in your browser using DataLab