Learn R Programming

podcleaner

The Scottish Post Office directories are annual directories that provide an alphabetical list of a town’s or county’s inhabitants including their forename, surname, occupation and address(es); they provide a solid basis for researching Scotland’s family, trade, and town history. A large number of these, covering most of Scotland and dating from 1773 to 1911, can be accessed in digitised form from the National Library of Scotland. podcleaner attempts to clean optical character recognition (OCR) errors in directory records after they’ve been parsed and saved to “csv” files using a third party tool[1]. The package further attempts to match records from trades and general directories. See the tests folder for examples running unexported functions.

Load

Load general and trades directory samples in memory from “csv” files:

  • Globals
library(podcleaner)

directories <- c("1861-1862")

progress <- TRUE; verbose <- FALSE
  • General directories
path_directories <- utils_make_path("data", "general-directories")

general_directory <- utils_load_directories_csv(
  type = "general", directories, path_directories, verbose
)

print.data.frame(general_directory)
#>   directory page    surname forename
#> 1 1861-1862   71       ABOT      Wm.
#> 2 1861-1862   71 ABRCROMBIE     Alex
#>                                                occupation
#> 1 Wine and spirit mercht — See Advertisement in Appendix.
#> 2                                                        
#>                                                    addresses
#> 1                           1S20 Londn rd; ho. 13<J Queun sq
#> 2 Bkr; I2 Dixon Street, & 29 Auderstn Qu.; res 2G5 Argul st.
  • Trades directories
path_directories <- utils_make_path("data", "trades-directories")

trades_directory <- utils_load_directories_csv(
  type = "trades", directories, path_directories, verbose
)

print.data.frame(trades_directory)
#>   directory page rank                                              occupation
#> 1 1861-1862   71  135 Wine and spirit mercht — See Advertisement in Appendix.
#> 2 1861-1862   71  326                                                     Bkr
#> 3 1861-1862   71  586                                               Victualer
#>          type    surname forename address.trade.body address.trade.number
#> 1 OWN ACCOUNT       ABOT      Wm.          Londn rd.                 1S20
#> 2 OWN ACCOUNT ABRCROMBIE     Alex           Dixen pl                   I2
#> 3 OWN ACCOUNT       BLAI  Jon Hug           High St.                  2S0

Clean

Clean records from both datasets:

  • General directories
general_directory <-
  general_clean_directory(general_directory, progress, verbose)

print.data.frame(general_directory)
#>   directory page    surname  forename               occupation
#> 1 1861-1862   71     Abbott   William Wine and spirit merchant
#> 2 1861-1862   71 Abercromby Alexander                    Baker
#> 3 1861-1862   71 Abercromby Alexander                    Baker
#>   address.trade.number address.trade.body address.house.number
#> 1               18, 20       London Road.                  136
#> 2                   12      Dixon Street.                  265
#> 3                   29    Anderston Quay.                  265
#>   address.house.body
#> 1      Queen Square.
#> 2     Argyle Street.
#> 3     Argyle Street.
  • Trades directories
trades_directory <-
  trades_clean_directory(trades_directory, progress, verbose)

print.data.frame(trades_directory)
#>   directory page rank    surname  forename               occupation        type
#> 1 1861-1862   71  135     Abbott   William Wine and spirit merchant OWN ACCOUNT
#> 2 1861-1862   71  326 Abercromby Alexander                    Baker OWN ACCOUNT
#> 3 1861-1862   71  586      Blair John Hugh               Victualler OWN ACCOUNT
#>   address.trade.number address.trade.body
#> 1               18, 20       London Road.
#> 2                   12       Dixon Place.
#> 3                  280       High Street.

Match

Match general to trades directory records:

distance <- TRUE; matches <- TRUE

directory <- combine_match_general_to_trades(
  trades_directory, general_directory, progress, verbose, distance, matches,
  method = "osa", max_dist = 5L
)

print.data.frame(directory)
#>   directory page rank    surname  forename               occupation        type
#> 1 1861-1862   71  135     Abbott   William Wine and spirit merchant OWN ACCOUNT
#> 2 1861-1862   71  326 Abercromby Alexander                    Baker OWN ACCOUNT
#> 3 1861-1862   71  586      Blair John Hugh               Victualler OWN ACCOUNT
#>   address.trade.number address.trade.body address.house.number
#> 1               18, 20       London Road.                  136
#> 2                   12       Dixon Place.                  265
#> 3                  280       High Street.                     
#>                       address.house.body distance
#> 1                          Queen Square.        0
#> 2                         Argyle Street.        5
#> 3 Failed to match with general directory       NA
#>                                     match
#> 1    Abbott William - 18, 20, London Road
#> 2 Abercromby Alexander - 12, Dixon Street
#> 3                                    <NA>

Directory records are compared and eventually matched using a distance metric calculated with the method and corresponding parameters specified in arguments. Under the hood the fuzzyjoin package and the stringdist_left_join function in particular, help with the matching operations.

Save

utils_IO_write(directory, "dev", "post-office-directory")
  1. See for example the python podparser library.

Copy Link

Version

Install

install.packages('podcleaner')

Monthly Downloads

95

Version

0.1.2

License

GPL (>= 3)

Maintainer

Olivier Bautheac

Last Published

January 11th, 2022

Functions in podcleaner (0.1.2)

clean_address_post_clean

Post-cleaning operation for address entry(/ies)
clean_address_mac

Standardise "Mac" prefix in address entry(/ies)
clean_address_places

Clean places in address entry(/ies)
clean_address_worksites

Clean worksites in address entry(/ies)
combine_has_match_failed

Check for failed matches
clean_address_suffixes

Clean unwanted suffixes in address entry(/ies)
combine_get_address_house_type

Get house address column type
clean_mac

Standardise "Mac" prefix in people's name
clean_name_ends

Clean ends in entry(/ies) names
clean_address_saints

Clean "Saint" prefix in address entry(/ies)
clean_specials

Clean entry(/ies) special characters
clean_string_ends

Clean string ends
clean_address_pre_clean

Pre-cleaning operation for address entry(/ies)
clean_surname_spelling

Clean surname(s) spelling
globals_regex_and_match

Regular expression for mutate operations in directory datasets
clean_forename

Clean entry(/ies) forename
clean_forename_punctuation

Standardise punctuation in forename(s)
combine_label_if_match_failed

Label failed matches
combine_label_failed_matches

Label failed matches
clean_occupation

Clean entry(/ies) occupation
globals_occupations

Occupations in directory records
globals_regex_get_address_house_type

Regular expression for mutate operations in directory datasets
globals_places_raw

Place types in address entries
globals_titles

Titles in directory name records
clean_forename_separate_words

Separate double-barrelled forename(s)
combine_match_general_to_trades_progress

Match general to trades directory records
combine_match_general_to_trades_plain

Match general to trades directory records
general_clean_entries

Mutate operation(s) in Scottish post office general directory data.frame column(s)
general_clean_directory_progress

Mutate operation(s) in Scottish post office general directory data.frame column(s)
general_clean_directory

Mutate operation(s) in Scottish post office general directory data.frame column(s)
general_move_house_to_address

Mutate operation(s) in Scottish post office general directory data.frame column(s)
combine_random_string_if_pattern

Conditionally return a random string
general_clean_directory_plain

Mutate operation(s) in Scottish post office general directory data.frame column(s)
clean_title

Clean entry(/ies) name title
clean_forename_spelling

Clean forename(s) spelling
combine_random_string_if_no_address

Conditionally return a random string
combine_no_trade_address_to_random_string

Mutate operation(s) in directory data.frame address.trade column.
clean_parentheses

Clean entry(/ies) of in brackets information
general_repatriate_occupation_from_address

Mutate operation(s) in Scottish post office general directory data.frame column(s)
general_split_address_numbers_bodies

Mutate operation(s) in Scottish post office general directory data.frame column(s)
globals_regex_occupation_from_address

Regular expression for mutate operations in directory datasets
globals_regex_irrelevants

Regular expression for mutate operations in directory datasets
globals_ampersand

Ampersand in directory entries
globals_ampersand_vector

Ampersand in directory entries
general_fix_structure

Mutate operation(s) in Scottish post office general directory data.frame column(s)
globals_suffixes

Address suffixes
globals_trades_colnames

Trades directory column names
trades_clean_directory_progress

Mutate operation(s) in Scottish post office trades directory data.frame column(s)
globals_and_single_quote

Ampersand in directory entries
globals_and_double_quote

Ampersand in directory entries
general_split_trade_addresses

Mutate operation(s) in Scottish post office general directory data.frame column(s)
utils_clean_address_body

Clean address(es) body
globals_surnames

Surnames in directory records
trades_clean_entries

Mutate operation(s) in Scottish post office trades directory data.frame column(s)
utils_clean_address_ends

Clean address entry ends
globals_numbers

Numbers in address entries
globals_macs

"Mac" pre-fixes in name entries
utils_IO_load

Load object into memory
globals_forenames

Forenames in directory records
globals_regex_split_address_body

Regular expression for mutate operations in directory datasets
globals_regex_split_address_empty

Regular expression for mutate operations in directory datasets
globals_general_colnames

General directory column names
globals_places_regex

Place types in address entries
utils_clean_address

Clean directory address entries
globals_saints

Saints in address names
utils_IO_write

Write object to long term memory
utils_clear_irrelevants

Mutate operation(s) in directory dataframe column(s)
globals_regex_address_house_body_number

Regular expression for mutate operations in directory datasets
globals_regex_titles

Regular expression for mutate operations in directory datasets
utils_IO_path

Make path for input/output operations
globals_union_colnames

Combined directories column names
globals_regex_house_to_address

Regular expression for mutate operations in directory datasets
globals_regex_house_split_trade

Regular expression for mutate operations in directory datasets
utils_execute

Execute function
globals_worksites

Worksites in address entries
utils_mutate_across

Mutate operation(s) in dataframe column(s)
globals_regex_split_address_numbers

Regular expression for mutate operations in directory datasets
utils_clean_addresses

Clean directory addresses
utils_clean_address_number

Clean address(es) number
globals_regex_split_trade_addresses

Regular expression for mutate operations in directory datasets
utils_load_directories_csv

Load directory "csv" file(s) into memory
utils_paste_if_found

Conditionally amend character string vector.
utils_label_missing_addresses

Label empty addresses as missing
utils_regmatches_if_found

Conditionally amend character string vector.
clean_surname_punctuation

Standardise punctuation in surname(s)
clean_surname

Clean entry(/ies) surname
combine_make_match_string

Mutate operation(s) in directory data.frame trade address column
utils_mute

Mute a function call execution
utils_clean_ends

Clean entry ends
utils_label_address_if_missing

Label addresses if missing
utils_clean_names

Clean entries name records
utils_is_address_missing

Check is address entry not missing
utils_squish_all_columns

Clear extra white spaces in dataframe
utils_split_and_name

Split string into tibble
combine_match_general_to_trades

Match general to trades directory records
globals_address_names

Place names in address entries
general_split_trade_house_addresses

Mutate operation(s) in Scottish post office general directory data.frame column(s)
utils_make_file

Make file name
utils_make_path

Make destination path
globals_regex_address_prefix

Regular expression for mutate operations in directory datasets
utils_clear_content

Clear string of matched content
globals_regex_and_filter

Regular expression for mutate operations in directory datasets
trades_clean_directory

Mutate operation(s) in Scottish post office trades directory data.frame column(s)
utils_gsub_if_found

Conditionally amend character string vector.
trades_clean_directory_plain

Mutate operation(s) in Scottish post office trades directory data.frame column(s)
utils_format_directory_raw

Format raw directory for further processing
utils_clean_occupations

Clean entries occupation record
utils_regmatches_if_not_empty

Conditionally amend character string vector.
utils_remove_address_prefix

Clear undesired address prefixes
clean_address_others

Miscellaneous cleaning operations in address entry(/ies)
clean_address_number

Clean address entry numbers
clean_address_attached_words

Clean attached words in address entry(/ies)
clean_address_names

Clean place name(s) in address entry(/ies)
clean_address_ends

Clean ends in address entry(/ies)
clean_address_possessives

Standardise possessives in address entry(/ies)
clean_address_body

Clean address entry(/ies) body