lexpand: Split case-level observations

Description

Given subject-level data, data is split by calendar time (per), age, and follow-up time (fot, from 0 to the end of follow-up) into subject-time-interval rows according to given breaks and additionally processed if requested.

Usage

lexpand(data, birth = bi_date, entry = dg_date, exit = ex_date,
  event = NULL, status = status != 0, entry.status = NULL,
  breaks = list(fot = c(0, Inf)), id = NULL, overlapping = TRUE,
  aggre = NULL, aggre.type = c("unique", "cross-product"), drop = TRUE,
  pophaz = NULL, pp = TRUE, subset = NULL, merge = TRUE,
  verbose = FALSE, ...)

Arguments

data

dataset of e.g. cancer cases as rows

birth

birth time in date format or fractional years; quoted or unquoted

entry

entry time in date format or fractional years; quoted or unquoted

exit

exit from follow-up time in date format or fractional years; quoted or unquoted

event

advanced: time of possible event differing from exit; typically only used in certain SIR/SMR calculations - see Details; keep NULL if exit is the time of the event; quoted or unquoted

status

variable indicating type of event at exit or event; e.g. status = status != 0; expression or quoted variable name

entry.status

input in the same way as status; status at entry; see Details

breaks

a named list of vectors of time breaks; e.g. breaks = list(fot=0:5, age=c(0,45,65,Inf)); see Details

an id variable; e.g. id = my_id; quoted or unquoted

overlapping

advanced, logical; if FALSE AND if data contains multiple rows per subject AND event is defined, ensures that the timelines lex.id-specific rows do not overlap; this ensures e.g. that person-years are

aggre

e.g. aggre = list(sex, fot); a list of unquoted variables and/or expressions thereof, which are interpreted as factors; data events and person-years will be aggregated by the unique combinations of these; see Details

aggre.type

either "unique" or "cross-product"; can be abbreviated; state transitions and person-year will be calculated either for all existing levels of expressions in aggre, or for the cross-product of all possible existi

drop

logical; if TRUE, drops all resulting rows after splitting that reside outside the time window as defined by the given breaks (all time scales)

pophaz

a dataset of population hazards to merge with splitted data; see Details

logical; if TRUE, computes Pohar-Perme weights using pophaz; adds variable with reserved name pp; see Details for computing method

subset

a logical vector or any logical condition; data is subsetted before splitting accordingly

merge

logical; if TRUE, retains all original variables from the data

verbose

logical; if TRUE, the function is chatty and returns some messages along the way

...

e.g. fot = 0:5; instead of specifying a breaks list, correctly named breaks vectors can be given for fot, age, and per; these override any breaks in the breaks list

Value

If aggre = NULL, returns a data.table or data.frame (depending on options("popEpi.datatable"); see ?popEpi) object expanded to accommodate split observations with time scales as fractional years and pophaz merged in if given. Population hazard levels in new variable pop.haz, and Pohar-Perme weights as new variable pp if requested. If aggre is defined, returns a long-format data.table/data.frame with the variable pyrs (person-years), and variables for the counts of transitions in state or state at end of follow-up formatted fromXtoY, where X and Y are the states transitioned from and to, respectively.

Details

Basics lexpand splits a given data set (with e.g. cancer diagnoses as rows) to subintervals of time over calendar time, age, and follow-up time with given time breaks using splitMulti. The dataset must contain appropriate Date / IDate / date format or other numeric variables that can be used as the time variables. You may take a look at a simulated cohort sire as an example of the minimum required information for processing data with lexpand. Breaks You should define all breaks as left inclusive and right exclusive time points (e.g.[a,b) ) for 1-3 time dimensions so that the last member of a breaks vector is a meaningful "final upper limit", e.g. per = c(2002,2007,2012) to create a last subinterval of the form [2007,2012). All breaks are explicit, i.e. if drop = TRUE, any data beyond the outermost breaks points are dropped. If one wants to have unspecified upper / lower limits on one time scale, use Inf: e.g. breaks = list(fot = 0:5, age = c(0,45,Inf)). Breaks for per can also be given in Date/IDate/date format, whereupon they are converted to fractional years before used in splitting. Time variables If any of the given time variables (birth, entry, exit, event) is in any kind of date format, they are first coerced to fractional years before splitting using get.yrs (with year.length = "actual"). Sometimes in e.g. SIR/SMR calculation one may want the event time to differ from the time of exit from follow-up, if the subject is still considered to be at risk of the event. If event is specified, the transition to status is moved to event from exit using cutLexis. See Examples. The status variable The statuses in the expanded output (lex.Cst and lex.Xst) are determined by using either only status or both status and entry.status. If entry.status = NULL, the status at entry is guessed according to the type of variable supplied via status: For numeric variables it will be zero, for factors the first level (levels(status)[1]) and otherwise the first unique value in alphabetical order (sort(unique(status))[1]). Using numeric or factor status variables is strongly recommended. Logical expressions are also allowed (e.g. status = my_status != 0L) and are converted to integer internally. Merging population hazard information To enable computing relative/net survivals with survtab and relpois, lexpand merges an appropriate population hazard data (pophaz) to the expanded data before dropping rows outside the specified time window (if drop = TRUE). pophaz must, for this reason, contain at a minimum the variables named agegroup, year, and haz. pophaz may contain additional variables to specify different population hazard levels in different strata; e.g. popmort includes sex. All the strata-defining variables must be present in the supplied data. lexpand will automatically detect variables with common names in the two datas and merge using them. Currently year must be an integer variable specifying the appropriate year. agegroup must currently also specify one-year age groups, e.g. popmort specifies 101 age groups of length 1 year. In both year and agegroup variables the values are interpreted as the lower bounds of intervals (and passed on to a cut call). The mandatory variable haz must specify the appropriate average rate at the person-year level; e.g. haz = -log(survProb) where survProb is a one-year conditional survival probability will be the correct hazard specification. **tajuan, mutta en osaa korjata!** The corresponding pophaz population hazard value is merged by using the mid points of the records after splitting as reference values. E.g. if age=89.9 at the start of a 1-year interval, then the reference age value is 90.4 for merging. This way we get a "typical" population hazard level for each record. Computing Pohar-Perme weights If pp = TRUE, Pohar-Perme weights (the inverse of cumulative population survival) are computed. This will create the new pp variable in the expanded data. pp is a reserved name and lexpand throws exception if a variable with that name exists in data. When a survival interval contains one or several rows per subject (e.g. due to splitting by the per scale), pp is cumulated from the beginning of the first record in a survival interval for each subject to the mid-point of the remaining time within that survival interval, and that value is given for every other record that a given person has within the same survival interval. E.g. with 5 rows of duration 1/5 within a survival interval [0,1)], pp is determined for all records by a cumulative population survival from 0 to 0.5. Th existing accuracy is used, so that the weight is cumulated first up to the end of the second row and then over the remaining distance to the mid-point (first to 0.4, then to 0.5). This ensures that more accurately merged population hazards are fully used. Aggregating Certain analyses such as SIR/SMR calculations require tables of events and person-years by the unique combinations (interactions) of several variables. For this, aggre can be specified as a list of such variables (preferably factor variables but nto mandatory) and any arbitrary functions of the variables at one's disposal. E.g. aggre = list(sex, agegr = cut(dg_age, 0:100)) would tabulate events and person-years by sex and an ad-hoc age group variable. Every ad-hoc-created variable should be named. fot, per, and age are special reserved variables which, when present in the aggre list, are outputted as categories of the corresponding time scale variables by using e.g. cut(fot, breaks$fot, right=FALSE). This only works if the corresponding breaks are defined in breaks or via .... E.g. aggre = list(sex, fot.int = fot) with breaks = list(fot=0:5). The outputted variable fot.int in the above example will have the lower limits of the appropriate intervals as values. aggre as a named list will output numbers of events and person-years with the given new names as categorizing variable names, e.g. aggre = list(follow_up = fot, gender = sex, agegroup = age). The ouputted table has person-years (pyrs) and event (mutation) counts (e.g. from0to1) as columns. Event counts are the numbers of mutations (lex.Cst != lex.Xst) or the lex.Xst value at a subject's last record (subject possibly defined by id). If aggre.type = "unique", the above results are computed for existing combinations of expressions given in aggre, but also for non-existing combinations if aggre.type = "cross-product". E.g. if a factor variable has levels "a", "b", "c" but the data is limited to only have levels "a", "b" present (more than zero rows have these level values), the former setting only computes results for "a", "b", and the latter also for "c" and any combination with other variables or expression given in aggre.

Examples

Run this code

## prepare data for e.g. 5-year cohort survival calculation
x <- lexpand(sire, breaks=list(fot=seq(0, 5, by = 1/12)),
             status =  status != 0, pophaz=popmort)

## prepare data for e.g. 5-year "period analysis" for 2008-2012
BL <- list(fot = seq(0, 5, by = 1/12), per = c("2008-01-01", "2013-01-01"))
x <- lexpand(sire, breaks = BL, pophaz=popmort, status =  status != 0)

## aggregating
BL <- list(fot = 0:5, per = c("2003-01-01","2008-01-01", "2013-01-01"))
ag <- lexpand(sire, breaks = BL, status = status != 0,
              aggre=list(sex, period = per, surv.int = fot))

## using "..."
x <- lexpand(sire, fot=0:5, pophaz=popmort, status =  status != 0)

x <- lexpand(sire, fot=0:5, status =  status != 0,
             aggre=list(sex, surv.int = fot))

## using the "event" argument: it just places the transition to given "status"
## at the "event" time instead of at the end, if possible using cutLexis
x <- lexpand(sire, status = status, event = dg_date, birth=bi_date, entry=bi_date, exit=ex_date)

## aggregating with custom "event" time

x <- lexpand(sire, status = status, event = dg_date, birth=bi_date, entry=bi_date, exit=ex_date,
             per = 1970:2014, age = c(0:100,Inf),
             aggre = list(sex, year = per, agegroup = age))

Run the code above in your browser using DataLab