Learn R Programming

SciencesPo (version 0.03.21)

soundexBR: Soundex Encoding For Portuguese BR

Description

SoundexBR returns a census-like soundex code a string name, given the Portuguese (Brazilian) sound system. This function was firstly outlined to work with RecordLinkage package, however, it also is helpful as a standalone function. See details bellow.

Usage

soundexBR(term)

Arguments

term
a list, a vector or a data frame with strings.

Value

  • A character vector or matrix with the same dimensions as term.

encoding

UTF-8

Details

This function assign a `soundex' code for strings based on their sounds, instead of their spelling. For instance, names that sound alike, but spelled differently, like SOUZA and SOUSA, will be assigned to an identically code. Therefore, this function may help on finding a surnanames even if they were registered with minor misspellings. The code consists of 4 digits long: a letter and three numbers as 0-000 . The integers refers to the remaining letters.

References

Borg, Andreas and Murat Sariyar. (2012) RecordLinkage: Record Linkage in R, R package version 0.4-1, http://CRAN.R-project.org/package=RecordLinkage.

Camargo Jr. and Coeli CM. (2000) Reclink: aplicativo para o relacionamento de bases de dados, implementando o método probabilistic record linkage. Cad. Saúde Pública, 16(2), Rio de Janeiro.

Marcelino, Daniel (2013) SciencesPo: A Tool Set for Analyzing Political Behaviour Data, (http://dx.doi.org/10.2139/ssrn.2320547.

See Also

soundexES, soundexFR.

Examples

Run this code
# Miscelania
names <- c('Ana Karolina Kuhnen',
'Ana Carolina Kuhnen', 'Ana Karolina',
'Dilma Vana Rousseff', 'Dilma Rousef')
  
soundexBR(names)

# Example with RecordLinkage
#Some data:
mydata1 <- data.frame(
fname=c('Ricardo','Maria','Tereza','Pedro','José', 'Germano'),
lname=c('Cunha','Andrade','Silva','Soares','Silva','Lima'),
age=c(67,89,78,65,68,67),
birth=c(1945,1923,1934,1947,1944,1945),
date=c(20120907,20120703,20120301,20120805,20121004,20121209))


mydata2<-data.frame(
fname=c('Maria','Lúcia','Paulo','Marcos', 'Ricardo', 'Germânio'),
lname=c('Andrada','Silva','Soares','Pereira','Cunha','Lima'),
age=c(67,88,78,60,68,80),
birth=c(1945,1924,1934,1952,1944,1932),
date=c(20121208,20121103,20120302,20120105,20121004,20121209))

# Must call RecordLinkage package

pairs<-compare.linkage(mydata1, mydata2,
blockfld=list(c(1,2,4),c(1,2)),
phonetic<-c(1,2), phonfun = soundexBR, strcmp = FALSE,
strcmpfun<-jarowinkler, exclude=FALSE,identity1 = NA,
identity2=NA, n_match <- NA, n_non_match = NA)
      
print(pairs)

editMatch(pairs)

# To access information in the object:  
weights <- epiWeights(pairs, e = 0.01, f = pairs$frequencies)
hist(weights$Wdata, plot = FALSE) # Plot TRUE
getPairs(pairs, max.weight = Inf, min.weight = -Inf)

Run the code above in your browser using DataLab