panel_locf: Fill in missing (or other) values of a panel data set using known data

Description

This function looks for a list of values (usually, just NA) in a variable .var and overwrites those values with the most recent (or next-coming) values that are not from that list ("last observation carried forward").

Usage

panel_locf(
  .var,
  .df = get(".", envir = parent.frame()),
  .fill = NA,
  .backwards = FALSE,
  .resolve = "error",
  .group_i = TRUE,
  .i = NULL,
  .t = NULL,
  .d = 1,
  .uniqcheck = FALSE
)

Arguments

.var

Vector to be modified.

.df

Data frame, pibble, or tibble (usually the one containing .var) that contains the panel structure variables either listed in .i and .t, or earlier declared with as_pibble(). If tlag is called inside of a dplyr verb, this can be omitted and the data will be picked up automatically.

.fill

Vector of values to be overwritten. Just NA by default.

.backwards

By default, values of newly-created observations are copied from the most recently available period. Set .backwards = TRUE to instead copy values from the closest *following* period.

.resolve

If there is more than one observation per individal/period, and the value of .var is identical for all of them, that's no problem. But what should panel_locf() do if they're not identical? Set .resolve = 'error' (or, really, any string) to throw an error in this circumstance. Or, set .resolve to a function that can be used within dplyr::summarize() to select a single value per individual/period. For example, .resolve = function(x) mean(x) to get the mean value of all observations present for that individual/period. .resolve will also be used to fill in values if some values in a given individual/period are to be overwritten and others aren't. Using a function will be quicker than .resolve = 'error', so if you're certain there's no issue, you can speed up execution by setting, say, .resolve = dplyr::first.

.group_i

By default, if .i is specified or found in the data, panel_locf() will group the data by .i, ignoring any grouping already implemented. Set .group_i = FALSE to avoid this.

Quoted or unquoted variables that identify the individual cases. Note that setting any one of .i, .t, or .d will override all three already applied to the data, and will return data that is as_pibble()d with all three, unless .setpanel=FALSE.

Quoted or unquoted variable indicating the time. pmdplyr accepts two kinds of time variables: numeric variables where a fixed distance .d will take you from one observation to the next, or, if .d=0, any standard variable type with an order. Consider using the time_variable() function to create the necessary variable if your data uses a Date variable for time.

Number indicating the gap in .t between one period and the next. For example, if .t indicates a single day but data is collected once a week, you might set .d=7. To ignore gap length and assume that "one period ago" is always the most recent prior observation in the data, set .d=0. By default, .d=1.

.uniqcheck

Logical parameter. Set to TRUE to always check whether .i and .t uniquely identify observations in the data. By default this is set to FALSE and the check is only performed once per session, and only if at least one of .i, .t, or .d is set.

Details

panel_locf() is unusual among last-observation-carried-forward functions (like zoo::na.locf()) in that it is usable even if observations are not uniquely identified by .t (and .i, if defined).

Examples

Run this code

# NOT RUN {

# The SPrail data has some missing price values.
# Let's fill them in!
# Note .d=0 tells it to ignore how big the gaps are
# between one period and the next, just look for the most recent insert_date
# .resolve tells it what value to pick if there are multiple
# observed prices for that route/insert_date
# (.resolve is not necessary if .i and .t uniquely identify obs,
# or if .var is either NA or constant within them)
# Also note - this will fill in using CURRENT-period
# data first (if available) before looking for lagged data.
data(SPrail)
sum(is.na(SPrail$price))
SPrail <- SPrail %>%
  dplyr::mutate(price = panel_locf(price,
    .i = c(origin, destination), .t = insert_date, .d = 0,
    .resolve = function(x) mean(x, na.rm = TRUE)
  ))

# The spec is a little easier with data like Scorecard where
# .i and .t uniquely identify observations
# so .resolve isn't needed.
data(Scorecard)
sum(is.na(Scorecard$earnings_med))
Scorecard <- Scorecard %>%
  # Let's speed this up by just doing four-year colleges in Colorado
  dplyr::filter(
    pred_degree_awarded_ipeds == 3,
    state_abbr == "CO"
  ) %>%
  # Now let's fill in NAs and also in case there are any erroneous 0s
  dplyr::mutate(earnings_med = panel_locf(earnings_med,
    .fill = c(NA, 0),
    .i = unitid, .t = year
  ))
# Note that there are still some missings - these are missings that come before the first
# non-missing value in that unitid, so there's nothing to pull from.
sum(is.na(Scorecard$earnings_med))
# }

Run the code above in your browser using DataLab