Learn R Programming

⚠️There's a newer version (1.0.11) of this package.Take me there.

campfin

The campfin package was created to facilitate the work being done on the The Accountability Project, a tool created by The Investigative Reporting Workshop in Washington, DC. The Accountability Project curates, cleans, and indexes public data to give journalists, researchers and others a simple way to search across otherwise siloed records.

The data focuses on people, organizations and locations. This package was created specifically to help with state-level campaign finance data, although the tools included are useful in general database exploration and normalization.

Installation

You can install the released version of campfin from CRAN with:

install.packages("campfin")

The development version can be installed from GitHub with:

# install.packages("remotes")
remotes::install_github("irworkshop/campfin")

Normalize

The package was originally built to normalize geographic data using the normal_*() functions, which take the messy self-reported geographic data of a contributor, vendor, candidate, or committee and return normalized text that is more searchable. They are largely wrappers around the stringr package, and can call other sub-functions to streamline normalization.

  • normal_address() takes a street address and reduces inconsistencies.
  • normal_zip() takes ZIP Codes and aims to return a valid 5-digit code.
  • normal_state() takes US states and returns a 2 digit abbreviation.
  • normal_city() takes cities and reduces inconsistencies.
  • normal_phone() consistently formats US telephone numbers.

Please see the vignette on normalization for an example of how these functions are used to fix a wide variety of string inconsistencies and make campaign finance data more consistent. In general, these functions fix the following inconsistencies:

  • Capitalize with str_to_upper()
  • Replace hyphens and underscores with str_replace()
  • Remove remaining punctuation with str_remove()
  • Remove either numbers or letters (depending on data) with str_remove()
  • Remove excess white space with str_trim() and str_squish()
  • Abbreviate addresses with abbrev_full() (and str_replace_all())
  • Remove invalid values with na_out() (and str_which())

Data

library(campfin)
library(tidyverse)

The campfin package contains a number of built in data frames and strings used to help wrangle campaign finance data.

#>  [1] "dark2"        "extra_city"   "invalid_city" "rx_phone"     "rx_state"     "rx_url"      
#>  [7] "rx_zip"       "usps_city"    "usps_state"   "usps_street"  "valid_abb"    "valid_city"  
#> [13] "valid_name"   "valid_state"  "valid_zip"    "zipcodes"

The /data-raw directory contains the code used to create the objects.

zipcodes

The zipcodes (plural) table is a new version of the zipcode (singular) table from the archived zipcode R package.

This database was composed using ZIP code gazetteers from the US Census Bureau from 1999 and 2000, augmented with additional ZIP code information The database is believed to contain over 98% of the ZIP Codes in current use in the United States. The remaining ZIP Codes absent from this database are entirely PO Box or Firm ZIP codes added in the last five years, which are no longer published by the Census Bureau, but in any event serve a very small minority of the population (probably on the order of .1% or less). Although every attempt has been made to filter them out, this data set may contain up to .5% false positives, that is, ZIP codes that do not exist or are no longer in use but are included due to erroneous data sources.

The included valid_city and valid_zip vectors are sorted, unique columns from the zipcodes data frame.

sample_frac(zipcodes)
#> # A tibble: 44,336 x 3
#>    city           state zip  
#>    <chr>          <chr> <chr>
#>  1 KANSAS CITY    MO    64155
#>  2 HOLT           FL    32564
#>  3 WOODBRIDGE     VA    22194
#>  4 MERRITT ISLAND FL    32954
#>  5 WOODSIDE       NY    11377
#>  6 CLAY           KY    42404
#>  7 ACRA           NY    12405
#>  8 COLUMBUS       OH    43271
#>  9 STATE COLLEGE  PA    16801
#> 10 COAMO          PR    00640
#> # … with 44,326 more rows

usps_* and valid_*

The usps_* data frames were scraped from the official United States Postal Service (USPS) Postal Addressing Standards. These data frames are designed to work with the abbreviation functionality of normal_address() and normal_city() to replace common abbreviations with their full equivalent.

usps_city is a curated subset of usps_state, whose full version appear at least once in the valid_city vector from zipcodes. The valid_state and valid_name vectors contain the columns from usps_state and include territories not found in R’s build in state.abb and state.name vectors.

sample_n(usps_street, 3)
#> # A tibble: 3 x 2
#>   full  abb  
#>   <chr> <chr>
#> 1 GTWAY GTWY 
#> 2 NORTH N    
#> 3 LIGHT LGT
sample_n(usps_state, 3)
#> # A tibble: 3 x 2
#>   full     abb  
#>   <chr>    <chr>
#> 1 OREGON   OR   
#> 2 MISSOURI MO   
#> 3 WYOMING  WY
setdiff(valid_state, state.abb)
#>  [1] "AS" "AA" "AE" "AP" "DC" "FM" "GU" "MH" "MP" "PW" "PR" "VI"

The campfin project is released with a Contributor Code of Conduct. By contributing, you agree to abide by its terms.

Copy Link

Version

Install

install.packages('campfin')

Monthly Downloads

327

Version

1.0.7

License

CC BY 4.0

Issues

Pull Requests

Stars

Forks

Maintainer

Kiernan Nicholls

Last Published

April 12th, 2021

Functions in campfin (1.0.7)

campfin

campfin package
col_date_mdy

Parse USA date columns in readr functions
add_prop

Add proportions
all_files_new

Check if all files in a directory are new
abbrev_state

Abbreviate US state names
check_city

Check whether an input is a valid place with Google Maps API
abbrev_full

Abbreviate full strings
count.character

Count values in a character vector
col_stats

Apply a statistic function to all column vectors
explore_plot

Create Basic Barplots
extra_city

Additional US City Names
count_in

Count in
is_abbrev

Check if abbreviation
keypad_convert

Convert letters or numbers to their keypad counterpart
invert_named

Invert a named vector
is_even

Check if even
is_binary

Check if Binary
expand_state

Expand US state names
expand_abbrev

Expand Abbreviations
read_names

Read column names
flag_dupes

Flag Duplicate Rows With New Column
count_diff

Count set difference
flag_na

Flag Missing Values With New Column
prop_distinct

Proportion missing
most_common

Find most common values
na_rep

Remove repeated character elements
dark2

Dark Color Palette
count_out

Count out
non_ascii

Show non-ASCII lines of file
rename_prefix

Convert data frame name suffixes to prefixes
rx_break

Form a word break regex pattern
fetch_city

Return Closest Match Result of Cities from Google Maps API
valid_zip

Almost all of the valid USA ZIP Codes
valid_state

US State Abbreviations
rx_phone

Phone number regex
%>%

Pipe operator
normal_address

Normalize street addresses
prop_in

Proportion in
progress_table

Create a progress table
this_file_new

Check if a single file is new
url2path

Make a File Path from a URL
rx_url

URL regex
guess_delim

Guess the delimiter of a text file
file_encoding

File Encoding
rx_state

State regex
usps_street

USPS Street Abbreviations
count_na

Count missing
na_in

Remove in
valid_abb

US State Abbreviations
flush_memory

Flush Garbage Memory
%out%

Inverted match
invalid_city

Invalid City Names
prop_na

Proportion missing
na_out

Remove out
normal_city

Normalize city names
normal_zip

Normalize ZIP codes
path.abbrev

Abbreviate a file path
prop_out

Proportion out
normal_phone

Normalize phone number
str_dist

Calculate string distance
normal_state

Normalize US State Abbreviations
rx_zip

ZIP code regex
scale_x_truncate

Truncate and wrap x-axis labels
str_normal

Normalize a character string
url_file_size

Check a URL file size
usps_state

USPS State Abbreviations
usps_city

USPS City Abbreviations
valid_city

US City Names
valid_name

US State Names
zipcodes

US City, state, and ZIP
use_diary

Create a new template data diary
what_out

Which out
what_in

Which in