Learn R Programming

healthbR (version 0.1.1)

vigitel_data: Load VIGITEL microdata

Description

Downloads (if necessary) and loads VIGITEL survey microdata into R. Data is automatically converted to Parquet format for faster subsequent loading. The data includes survey weights for proper statistical analysis.

Usage

vigitel_data(
  year,
  vars = NULL,
  force_download = FALSE,
  parallel = TRUE,
  lazy = FALSE,
  cache_dir = NULL
)

Value

A tibble with the VIGITEL microdata. When multiple years are requested, a year column is added to identify the source year. If lazy = TRUE, returns an Arrow Dataset that can be queried with dplyr verbs before calling collect().

Arguments

year

Year(s) of the survey. Can be:

  • Single year: 2023

  • Range: 2021:2023

  • Vector: c(2021, 2023)

  • Character: c("2021", "2023")

  • All years: "all"

vars

Character vector. Variable names to select, or NULL for all variables. Default is NULL.

force_download

Logical. If TRUE, re-download and reconvert data. Default is FALSE.

parallel

Logical. If TRUE, download and process multiple years in parallel. Default is TRUE when multiple years are requested.

lazy

Logical. If TRUE, return an Arrow Dataset for lazy evaluation instead of loading all data into memory. Useful for filtering large datasets before collecting. Use collect() to retrieve results. Default is FALSE.

cache_dir

Character. Optional custom cache directory. If NULL (default), uses the standard user cache directory. Use tempdir() for temporary storage that won't persist.

Details

On first access, data is downloaded from the Ministry of Health and converted to Parquet format. Subsequent loads read directly from the Parquet file, which is significantly faster.

The arrow package is required for Parquet file support. If not installed, an informative error message will be shown with installation instructions.

For parallel downloads, the function uses the furrr and future packages if installed. Install them with install.packages(c("furrr", "future")) to enable parallel processing. The number of workers is automatically set based on available CPU cores. If these packages are not installed, processing falls back to sequential mode.

When lazy = TRUE, the function returns an Arrow Dataset that supports dplyr operations (filter, select, mutate, etc.) without loading data into memory. This is useful for working with large datasets or when you only need a subset of the data. Call collect() to retrieve the results as a tibble.

The VIGITEL survey uses complex sampling weights. For proper statistical analysis, use survey packages like survey or srvyr. The weight variable is named pesorake.

Examples

Run this code
# \donttest{
# single year (uses tempdir to avoid leaving files on system)
df <- vigitel_data(2023, cache_dir = tempdir())

# specific variables
df <- vigitel_data(2023, vars = c("cidade", "sexo", "idade", "pesorake"),
                   cache_dir = tempdir())
# }

Run the code above in your browser using DataLab