findExamples: Find all pairs/triples/... with corresponding sequences of sounds.

Description

Sift the dataset for word pairs/triples/... such that the first word in the first languages contains the first sequence, the one in the second language the second sequence, and so on.

Usage

findExamples(
  data,
  ...,
  distance.start,
  distance.end,
  na.value,
  zeros,
  cols,
  perl
)

Arguments

data

[soundcorrs] The dataset in which to look.

...

[character] Sequences for which to look. May be regular expressions as defined in R, or in the transcription. If an empty string, anything will be considered a match.

distance.start

[integer] The allowed distance between segments where the sound sequences begin. A negative value means alignment of the beginning of sequences will not be checked. Defaults to -1.

distance.end

[integer] The allowed distance between segments where the sound sequences end. A negative value means alignment of the end of sequences will not be checked. Defaults to -1.

na.value

[numeric] Treat NA's as matches (0) or non-matches (-1)? Note that an empty string query takes precedence over na.value, that is even whan na.value is set to -1, NA's will show up in the results when the query is an empty string. Defaults to 0.

zeros

[logical] Take linguistic zeros into account? Defaults to FALSE.

cols

[character vector] Which columns of the dataset to return as the result. Can be a vector of names, "aligned" (the two columns with segmented, aligned words), or "all" (all columns). Defaults to "aligned".

perl

[logical] Use Perl-compatible regular expressions? Defaults to FALSE.

Value

[df.findExamples] A list with two fields: $data, a data frame with found examples; and $which, a logical vector showing which rows of data are considered matches.

Details

One of the more time-consuming tasks, when working with sound correspondences, is looking for specific examples which realize the given correspondence. findExamples can fully automate this process. It has several arguments that can help fine-tune the search, of which perhaps the most important are distance.start and distance.end. It should be noted that their default values (-1 for both) mean that findExamples will find every such pair/triple/... of words, that the first word contains the first query, the second word the second query, etc. -- regardless of whether these segments do in fact correspond to each other in the alignment. This is intentional, and stems from the assumption that in this case, false positives are generally less harmful, and most of all easier to spot than false negatives.

findExamples accepts regular expressions in queries, both such as are available in pure R, and such as have been defined in the transcription, in both notations accepted by expandMeta. It is highly recommended that the user acquaints him or herself with the concept, as it is in it that the true power of findExamples lies.

Examples

Run this code

# NOT RUN {
# In the examples below, non-ASCII characters had to be escaped for technical reasons.
# In the actual usage, Unicode is supported under BSD, Linux, and macOS.

# prepare sample dataset
dataset <- loadSampleDataset ("data-capitals")
# find examples which have "a" in all three languages
findExamples (dataset, "a", "a", "a")
# find examples where German has schwa, and Polish and Spanish have a Vr sequence
findExamples (dataset, "\u0259", "Vr", "Vr")
# as above, but the schwa and the two vowels must be in the same segment
findExamples (dataset, "\u0259", "V(?=r)", "V(?=r)", distance.start=0, distance.end=0, perl=TRUE)
# find examples where German has a-umlaut, Polish has a or e, and Spanish has any sound at all
findExamples (dataset, "\u00E4", "[ae]", "")
# find examples where German has a linguistic zero while Polish and Spanish do not
findExamples (dataset, "-", "[^-]", "[^-]", zeros=TRUE)
# find examples where German has schwa, and Polish and Spanish have a
findExamples (dataset, "\u0259", "a", "a", distance.start=-1, distance.end=-1)
# as above, but the schwa and the two a's must be in the same segment
findExamples (dataset, "\u0259", "a", "a", distance.start=0, distance.end=0)
# }