
Last chance! 50% off unlimited learning
Sale ends in
Reads a dataset downloaded from the IPUMS extract system, but does so by reading a chunk, then applying your code to that chunk and then continuing, which can allow you to deal with data that is too large to store in your computer's RAM all at once.
read_ipums_micro_chunked(ddi, callback, chunk_size = 10000,
vars = NULL, data_file = NULL, verbose = TRUE,
var_attrs = c("val_labels", "var_label", "var_desc"))read_ipums_micro_list_chunked(ddi, callback, chunk_size = 10000,
vars = NULL, data_file = NULL, verbose = TRUE,
var_attrs = c("val_labels", "var_label", "var_desc"))
Either a filepath to a DDI xml file downloaded from
the website, or a ipums_ddi
object parsed by read_ipums_ddi
An ipums_callback
object, or a function
that will be converted to an IpumsSideEffectCallback object.
An integer indicating how many observations to read in per chunk (defaults to 10,000). Setting this higher uses more RAM, but will usually be faster.
Names of variables to load. Accepts a character vector of names, or
dplyr_select_style
conventions. For hierarchical data, the
rectype id variable will be added even if it is not specified.
Specify a directory to look for the data file. If left empty, it will look in the same directory as the DDI file.
Logical, indicating whether to print progress information to console.
Variable attributes to add from the DDI, defaults to
adding all (val_labels, var_label and var_desc). See
set_ipums_var_attributes
for more details.
Depends on the callback object
Other ipums_read: read_ipums_micro_yield
,
read_ipums_micro
,
read_ipums_sf
, read_nhgis
,
read_terra_area
,
read_terra_micro
,
read_terra_raster
# NOT RUN {
# Select Minnesotan cases from CPS example (Note you can also accomplish
# this and avoid having to even download a huge file using the "Select Cases"
# functionality of the IPUMS extract system)
mn_only <- read_ipums_micro_chunked(
ipums_example("cps_00006.xml"),
IpumsDataFrameCallback$new(function(x, pos) {
x[x$STATEFIP == 27, ]
}),
chunk_size = 1000 # Generally you want this larger, but this example is a small file
)
# Tabulate INCTOT average by state without storing full dataset in memory
library(dplyr)
inc_by_state <- read_ipums_micro_chunked(
ipums_example("cps_00006.xml"),
IpumsDataFrameCallback$new(function(x, pos) {
x %>%
mutate(
INCTOT = lbl_na_if(
INCTOT, ~.lbl %in% c("Missing.", "N.I.U. (Not in Universe)."))
) %>%
filter(!is.na(INCTOT)) %>%
group_by(STATEFIP = as_factor(STATEFIP)) %>%
summarize(INCTOT_SUM = sum(INCTOT), n = n())
}),
chunk_size = 1000 # Generally you want this larger, but this example is a small file
) %>%
group_by(STATEFIP) %>%
summarize(avg_inc = sum(INCTOT_SUM) / sum(n))
# x will be a list when using `read_ipums_micro_list_chunked()`
read_ipums_micro_list_chunked(
ipums_example("cps_00010.xml"),
IpumsSideEffectCallback$new(function(x, pos) {
print(paste0(nrow(x$PERSON), " persons and ", nrow(x$HOUSEHOLD), " households in this chunk."))
}),
chunk_size = 1000 # Generally you want this larger, but this example is a small file
)
# Using the biglm package, you can even run a regression without storing
# the full dataset in memory
library(dplyr)
if (require(biglm)) {
lm_results <- read_ipums_micro_chunked(
ipums_example("cps_00015.xml"),
IpumsBiglmCallback$new(
INCTOT ~ AGE + HEALTH, # Simple regression (may not be very useful)
function(x, pos) {
x %>%
mutate(
INCTOT = lbl_na_if(
INCTOT, ~.lbl %in% c("Missing.", "N.I.U. (Not in Universe).")
),
HEALTH = as_factor(HEALTH)
)
}),
chunk_size = 1000 # Generally you want this larger, but this example is a small file
)
summary(lm_results)
}
# }
Run the code above in your browser using DataLab