soundcorrs: Constructor function for the `soundcorrs` class.

Description

Take a data frame and turn it into a soundcorrs object containing data for one language. To obtain a soundcorrs object containing data for multiple languages, see merge.soundcorrs. In the normal workflow, the user should have no need to call this constructor other than through read.soundcorrs.

Usage

soundcorrs(data, name, col.aligned, transcription, separator = "\\|")

Arguments

data

[data.frame] Data for one language.

name

[character] Name of the language.

col.aligned

[character] Name of the column with the aligned words.

transcription

[transcription] The transcription for the given language.

separator

[character] String used to separate segments in col.aligned. Defaults to "\|".

Value

[soundcorrs] An object containing the provided data and metadata for one language.

Fields

cols: [character list] Names of important columns.

data

[data.frame] The original data.

names

[character] Name of the language.

segms

[character list] Words exploded into segments. With linguistic zeros preserved ($z) or removed ($nz).

segpos

[integer list] A lookup list to check which character belongs to which segment. Counted with linguistic zeros preserved ($z) and removed ($nz).

separators

[character] The strings used as segment separator in cols$aligned.

trans

[transcription] The transcription.

words

[character list] Words obtained by removing separators from the cols$aligned columns. With linguistic zeros ($z) or without them ($nz).

Details

soundcorrs is the fundamental class of the entire soundcorrs package, and it is required for most tasks that the package promises to make easier and faster than manual labour. A soundcorrs object is a list containing the original data frame, some metadata (names of languages, names of columns, transcriptions), as well as transformations of the original data for faster processing in findExamples and other functions (words exploded into individual segments, with segment separators removed, etc.). The basic unit in soundcorrs is a pair/triple/... of words, each of which is assigned to a specific language.

This constructor function is not really intended for the end user. Whenever possible, read.soundcorrs should be used instead. Regardless of the function used, two pieces of information are required for each word: the language it comes from, and its segmented and aligned form. Segmentation means that the word is cut into parts which can represent phonemes, morphemes, or anything else (the default separator is a vertical bar, "|"). A word with no separators in it is considered one big segment, and in fact, for soundchange's this is enough. Alignment means that each word in a pair/triple/... has the same number of segments, and that those segments are in the corresponding places. Often, one of the words in a pair/triple/... will naturally have fewer segments than the others; in such cases, a filler character, 'linguistic zero' needs to be used ("-" is a good choice); for example, to align the Spanish and Swedish names for 'Stockholm', a total of three such 'empty' segments is required: e|s|t|o|k|-|o|l|m|o : -|s|t|o|k|k|o|l|m|-. Linguistic zero must be defined in the transcription.

Typically, a soundcorrs object will be used to hold an entire list of pairs/triples/... of words from various languages. However, both this constructor function and read.soundcorrs can only read data from one language at a time. This is because each language requires relatively many pieces of metadata (name, column names, transcription), and if all of this information for multiple languages were to be passed as arguments to one function, the call would very quickly become illegible. Multiple soundcorrs objects can be merged into one using merge.soundcorrs.

Three sample datasets are available: data-abc, data-capitals, and data-ie; they can be loaded with the help of loadSampleDataset.

Examples

Run this code

# NOT RUN {
# prepare sample transcription
trans <- loadSampleDataset ("trans-common")
# read sample data in the "wide format"
fNameData <- system.file ("extdata", "data-capitals.tsv", package="soundcorrs")
readData <- read.table (fNameData, header=TRUE)
# make out of them a soundcorrs object
ger <- soundcorrs (readData, "German", "ALIGNED.German", trans)
pol <- soundcorrs (readData, "Polish", "ALIGNED.Polish", trans)
spa <- soundcorrs (readData, "Spanish", "ALIGNED.Spanish", trans)
dataset <- merge (ger, pol, spa)
# }

Run the code above in your browser using DataLab