Learn R Programming

SciencesPo (version 0.11.21)

soundexBR: Soundex Encoding For Portuguese BR

Description

SoundexBR returns a census-like soundex code of a string name, given the Portuguese (Brazilian) sound system. This function was firstly outlined to work with RecordLinkage package, however, it is also helpful as a standalone function. See details bellow.

Usage

soundexBR(term)

Arguments

term
a list, a vector or a data frame with strings.

Value

  • A character vector or matrix with the same dimensions as term.

encoding

UTF-8

Details

This function assign a soundex code for strings based on their sounds, instead of their spelling. For instance, names that sound alike, but spelled differently, like SOUZA and SOUSA, is assigned an identically code. Therefore, this function may help on to finding names even when they were registered with minor misspellings. The code consists of 4 digits long: a letter and three numbers as 0-000 and that the integers refers to the remaining letters.

References

Borg, Andreas and Murat Sariyar. (2012) RecordLinkage: Record Linkage in R, R package version 0.4-1, http://CRAN.R-project.org/package=RecordLinkage.

Camargo Jr. and Coeli CM. (2000) Reclink: aplicativo para o relacionamento de bases de dados, implementando o método probabilistic record linkage. Cad. Saúde Pública, 16(2), Rio de Janeiro.

Marcelino, Daniel (2013) SciencesPo: A Tool Set for Analyzing Political Behaviour Data, (http://dx.doi.org/10.2139/ssrn.2320547.

Paula, Fátima de Lima (2014) Readmissão Hospitalar de Idosos após Internação por Fratura Proximal do Fêmur no Município do Rio de Janeiro, Doctoral thesis, Fiocruz.

Examples

Run this code
# A silly example:
names <- c('Ana Karolina Kuhnen',
'Ana Carolina Kuhnen', 'Ana Karolina',
'Dilma Vana Rousseff', 'Dilma Rousef')
  
soundexBR(names)

# Example with RecordLinkage:
#Some data:
data1 <- data.frame(list(
fname=c('Ricardo','Maria','Tereza','Pedro','José','Germano'),
lname=c('Cunha','Andrade','Silva','Soares','Silva','Lima'),
age=c(67,89,78,65,68,67),
birth=c(1945,1923,1934,1947,1944,1945),
date=c(20120907,20120703,20120301,20120805,20121004,20121209)
))

data2<-data.frame( list( fname=c('Maria','Lúcia','Paulo','Marcos','Ricardo','Germânio'),
lname=c('Andrada','Silva','Soares','Pereira','Cunha','Lima'),
age=c(67,88,78,60,68,80),
birth=c(1945,1924,1934,1952,1944,1932),
date=c(20121208,20121103,20120302,20120105,20121004,20121209)
))


# Must call RecordLinkage package

pairs<-compare.linkage(data1, data2,
blockfld=list(c(1,2,4),c(1,2)),
phonetic<-c(1,2), phonfun = soundexBR, strcmp = FALSE,
strcmpfun<-jarowinkler, exclude=FALSE,identity1 = NA,
identity2=NA, n_match <- NA, n_non_match = NA)
      
print(pairs)

editMatch(pairs)

# To access information in the object:  
weights <- epiWeights(pairs, e = 0.01, f = pairs$frequencies)
hist(weights$Wdata, plot = FALSE) # Plot TRUE
getPairs(pairs, max.weight = Inf, min.weight = -Inf)

Run the code above in your browser using DataLab