Learn R Programming

RecordLinkage (version 0.3-2)

RLdata: Test data for Record Linkage

Description

These tables contain artificial personal data for the evaluation of Record Linkage procedures. Some records have been duplicated with randomly generated errors. RLdata500 contains 50 duplicates, RLdata10000 1000 duplicates.

Usage

RLdata500
RLdata10000
identity.RLdata500
identity.RLdata10000

Arguments

source

Generated with the data generation component of Febrl (Freely Extensible Biomedical Record Linkage), version 0.3. See http://datamining.anu.edu.au/projects/linkage.html for details. The following data sources were used (all relate to Germany): http://blog.beliebte-vornamen.de/2009/02/prozentuale-anteile-2008/, a list of the frequencies of the 20 most popular female names in 2008. http://www.beliebte-vornamen.de/760-alle_jahre.htm, a list of the 100 most popular first names since 1890. The frequencies found in the source above were extrapolated to fit this list. http://www.peter-doerling.de/Geneal/Nachnamen_100.htm, a list of the 100 most frequent family names with frequencies. Age distribution as of Dec 31st, 2008, statistics of Statistisches Bundesamt Deutschland, taken from the GENESIS database (https://www-genesis.destatis.de/genesis/online/logon). Web links as of October 2009.