Learn R Programming

ipumsr (version 0.9.0)

read_ipums_agg: Read data from an IPUMS aggregate data extract

Description

Read a .csv file from an extract downloaded from an IPUMS aggregate data collection (IPUMS NHGIS or IPUMS IHGIS).

To read spatial data from an NHGIS extract, use read_ipums_sf().

Usage

read_ipums_agg(
  data_file,
  file_select = NULL,
  vars = NULL,
  col_types = NULL,
  n_max = Inf,
  guess_max = min(n_max, 1000),
  var_attrs = c("val_labels", "var_label", "var_desc"),
  remove_extra_header = TRUE,
  file_encoding = NULL,
  verbose = TRUE
)

Value

A tibble containing the data found in data_file

Arguments

data_file

Path to a .zip archive containing an IPUMS NHGIS or IPUMS IHGIS extract or a single .csv file from such an extract.

file_select

If data_file is a .zip archive that contains multiple files, an expression identifying the file to load. Accepts a character vector specifying the file name, a tidyselect selection, or an index position. This must uniquely identify a file.

vars

Names of variables to include in the output. Accepts a vector of names or a tidyselect selection. If NULL, includes all variables in the file.

col_types

One of NULL, a cols() specification or a string. If NULL, all column types will be inferred from the values in the first guess_max rows of each column. Alternatively, you can use a compact string representation to specify column types:

  • c = character

  • i = integer

  • n = number

  • d = double

  • l = logical

  • f = factor

  • D = date

  • T = date time

  • t = time

  • ? = guess

  • _ or - = skip

See read_delim() for more details.

n_max

Maximum number of lines to read.

guess_max

For .csv files, maximum number of lines to use for guessing column types. Will never use more than the number of lines read.

var_attrs

Variable attributes to add from the codebook (.txt) file included in the extract. Defaults to all available attributes.

See set_ipums_var_attributes() for more details.

remove_extra_header

If TRUE, remove the additional descriptive header row included in some NHGIS .csv files.

This header row is not usually needed as it contains similar information to that included in the "label" attribute of each data column (if var_attrs includes "var_label").

file_encoding

Encoding for the file to be loaded. For NHGIS extracts, defaults to ISO-8859-1. For IHGIS extracts, defaults to UTF-8. If the default encoding produces unexpected characters, adjust the encoding here.

verbose

Logical controlling whether to display output when loading data. If TRUE, displays IPUMS conditions, a progress bar, and column types. Otherwise, all are suppressed.

Will be overridden by readr.show_progress and readr.show_col_types options, if they are set.

See Also

read_ipums_sf() to read spatial data from an IPUMS extract.

read_nhgis_codebook() or read_ihgis_codebook() to read metadata about an IPUMS aggregate data extract.

ipums_list_files() to list files in an IPUMS extract.

Examples

Run this code
nhgis_file <- ipums_example("nhgis0972_csv.zip")
ihgis_file <- ipums_example("ihgis0014.zip")

# Provide the .zip archive directly to load the data inside:
read_ipums_agg(nhgis_file)

# For extracts that contain multiple files, use `file_select` to specify
# a single file to load. This accepts a tidyselect expression:
read_ipums_agg(ihgis_file, file_select = matches("AAA_g0"), verbose = FALSE)

# Or an index position:
read_ipums_agg(ihgis_file, file_select = 2, verbose = FALSE)

# Variable metadata is automatically attached to data, if available
ihgis_data <- read_ipums_agg(ihgis_file, file_select = 2, verbose = FALSE)
ipums_var_info(ihgis_data)

# Column types are inferred from the data. You can
# manually specify column types with `col_types`. This may be useful for
# geographic codes, which should typically be interpreted as character values
read_ipums_agg(nhgis_file, col_types = list(MSA_CMSAA = "c"), verbose = FALSE)

# You can also read in a subset of the data file:
read_ipums_agg(
  nhgis_file,
  n_max = 15,
  vars = c(GISJOIN, YEAR, D6Z002),
  verbose = FALSE
)

Run the code above in your browser using DataLab