seqformat: Conversion between sequence representation formats

Description

Convert a sequence data set from one representation format to another.

Usage

seqformat(data, var = NULL, from, to, compress = FALSE, nrep = NULL, tevent,
  stsep = NULL, covar = NULL, SPS.in = list(xfix = "()", sdsep = ","),
  SPS.out = list(xfix = "()", sdsep = ","), id = 1, begin = 2, end = 3,
  status = 4, process = TRUE, pdata = NULL, pvar = NULL, limit = 100,
  overwrite = TRUE, fillblanks = NULL, tmin = NULL, tmax = NULL, missing = "*",
  with.missing = TRUE, right="DEL", compressed, nr)

Value

A data frame for SRS, TSE, and SPELL outcomes, otherwise a matrix.

When from="SPELL", outcome has an attribute issues with indexes of sequences with issues (truncated sequences, missing start time, spells before birth year, ...)

Arguments

data

Data frame, matrix, stslist state sequence object, or character string vector. The data to use. (If a tibble, data is internally converted with as.data.frame).

A data frame or a matrix with sequence data in one or more columns when from = "STS" or from = "SPS". If sequence data are in a single column or in a string vector, they are assumed to be in compressed form (see stsep).

A data frame with at least four columns when from = "SPELL". Unless specified with the var, or id / begin / end / status arguments, the first four columns are assumed to be individual ID, spell start time, spell end time, and spell state status.

A state sequence object when from = "STS" or from is not specified.

var

NULL, List of Integers or Strings. Default: NULL. Indexes or names of the columns containing the sequence information in data. If NULL, all columns are considered.

from

String. Format of the input sequence data. It can be "STS" (successive states), "SPS" (successive state-duration spells), or "SPELL" (vertical id-start-end-state spells). Ignored when data is a stslist state sequence object.

to

String. Format of the output data. It can be "STS" (successive states), "DSS" (distinct successive states), "SPS" (sequences of spells), "SRS" (shifted replicated sequences), "SPELL" (vertical spells), or "TSE" (time stamped events).

compress

Logical. Default: FALSE. When to = "STS", to = "DSS", or to = "SPS", should the sequences (row vector of states) be concatenated into strings? See seqconc.

nrep

Integer. Number of shifted replications when to = "SRS".

tevent

Matrix. The transition-definition matrix when to = "TSE". It should be of size \(d * d\) where \(d\) is the number of distinct states appearing in the sequences. The cell \((i,j)\) lists the events associated with a transition from state \(i\) to state \(j\). It can be created with seqetm.

stsep

NULL, Character. Default: NULL. When from = "STS" or from = "SPS", separator token between states in the compressed form (strings). If NULL, seqfcheck is called for detecting automatically a separator among "-" and ":". Other separators must be explicitly specified. See seqdecomp.

covar

List of Integers or Strings. When to = "SRS", indexes or names of data columns to include as covariates in the output. Ignored otherwise. Applies only when data is a data frame with both sequence and covariate data. Must be used in conjunction with var. Covariate values are replicated across the shifted replicated rows.

SPS.in

List. Default: list(xfix = "()", sdsep = ","). Specifications for the state-duration couples in the input data when from = "SPS". The first element, xfix, specifies the prefix/suffix character. If a single character, it is used as both prefix and suffix. If a two-character string, the first character is used as prefix and the second one as suffix. xfix = "" means no prefix/suffix. The second element, sdsep, specifies the separator token between state and duration.

SPS.out

List. Default: list(xfix = "()", sdsep = ","). The specifications for the state-duration couples in the output data when to = "SPS". See SPS.in above.

id

NULL, Integer, String, Vector of Integers or Strings. Default: 1.

When from = "SPELL", index or name of the column containing the individual IDs in data (after var filtering).

When to = "TSE", index or name of the data column containing the individual IDs (after var filtering), or vector of unique individual IDs. If NULL, indexes of the sequences in the input data are used as IDs. If no id is provided when calling the function and from is not "SPELL", id is set as NULL.

When from = "SPELL" and to = "TSE", id cannot be NULL and the IDs in the TSE output refer to the IDs in the id column of the SPELL data.

begin

Integer or String. Default: 2. Index or name of the data column containing the spell start times (after var filtering) when from = "SPELL". Start times must be positive integers.

end

Integer or String. Default: 3. Index or name of the data column containing the spell end times (after var filtering) when from = "SPELL". End times must be positive integers.

status

Integer or String. Default: 4. Index or name of the data column containing the spell statuses (after var filtering) when from = "SPELL".

process

Logical. Default: TRUE. When from = "SPELL", if TRUE, create sequences on a process time axis, if FALSE, create sequences on a calendar time axis.

This process argument as well as the associated pdata and pvar arguments are intended for data containing spell data with calendar begin and end times. When those times are ages, use process = FALSE with pdata=NULL to use those ages as process times. Option process = TRUE does currently not work for age times.

pdata

NULL, "auto", or data frame. Default: NULL. (tibbles are internally converted with as.data.frame).

To be used only with from = "SPELL" or to = "SPELL".

If NULL, start and end times of each spell in the from data are supposed to be ages if process = TRUE, and years if process = FALSE.

If "auto", ages are computed using the start time of the first spell of each individual as her/his birthdate and process = TRUE. For process = FALSE, "auto" is equivalent to NULL.

A data frame containing the ID and the birth time of the individuals when from = "SPELL" or to = "SPELL". Use pvar to specify the column names. The ID is used to match the birth time of each individual with the sequence data. The birth time should be integer. It is the start time from which the positions on the time axis are computed. It also serves to compute tmin and to guess tmax when the latter are NULL, from = "SPELL", and process = FALSE.

pvar

List of Integers or Strings. The indexes or names of the columns of the data frame pdata that contain the ID and the birth time of the individuals in that order.

limit

Integer. Default: 100. The maximum age of age sequences when from = "SPELL" and process = TRUE. Age sequences will be considered to start at 1 and to end at limit.

overwrite

Logical. Default: TRUE. When from = "SPELL", if TRUE, the most recent episode overwrites the older one when they overlap each other, if FALSE, the most recent episode starts after the end of the previous one.

fillblanks

Character. Token used to fill gaps between episodes when from = "SPELL".

tmin

NULL or Integer. Default: NULL. When from = "SPELL" and process = FALSE, start time of the axis. If NULL, tmin is set as the earliest spell start time (min of begin) or, when pdata is a data frame, as the earliest birth time of the individuals.

tmax

NULL or Integer. Default: NULL. When from = "SPELL" and process = FALSE, end time of the axis. If NULL, tmax is set as the latest spell end time (max of end) or, when pdata is a data frame, as the sum of the latest spell end time and the latest birth time of the individuals.

missing

String. Default: "*". Token used for missing states in data. It will be replaced by NA in the output data. Ignored when data is a state sequence object (see seqdef), in which case the attribute nr is used as missing value token.

with.missing

Logical. Default: TRUE. When to = "SPELL", should the spells of missing states be included?

right

One of "DEL" or NA. Default: "DEL". When to = "SPELL" and with.missing=TRUE, set right=NA to include ending spells of missing states.

compressed

Deprecated. Use compress instead.

nr

Deprecated. Use missing instead.

Author

Gilbert Ritschard, Alexis Gabadinho, Pierre-Alexandre Fonta, Nicolas S. Müller, Matthias Studer

Details

The seqformat function converts data from one format to another. The input data is first converted into STS format and then converted into the output format. Depending on input and output formats, some information can be lost during the conversion process. The output is a matrix or a data frame, NOT a sequence stslist object. To process, print, and plot the sequences with TraMineR functions, you will have to first transform the returned data frame into a stslist state sequence object with seqdef. See Gabadinho et al. (2009) and Ritschard et al. (2009) for more details on longitudinal data formats and conversion between them.

When data is in "SPELL" format (from = "SPELL"), begin and end times are expected to be positions in the sequences. Therefore, they should be strictly positive integers. With process=TRUE, the outcome sequences will be aligned on ages (process duration since birth), while with process=FALSE they will be aligned on dates (position on the calendar time). If process=TRUE, values in the begin and end columns of data are assumed to be ages when pdata is NULL and integer dates otherwise. If process=FALSE, begin and end values are assumed to be integer dates when pdata is NULL and ages otherwise.

To convert from person-period data use from = "SPELL" and set both begin and end as the index or name of the time (period) column. Alternatively, use the reshape command of stats, which is more efficient.

References

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva.

Ritschard, G., A. Gabadinho, M. Studer and N. S. Müller. Converting between various sequence representations. in Ras, Z. & Dardzinska, A. (eds.) Advances in Data Management, Springer, 2009, 223, 155-175.

Examples

Run this code

## ========================================
## Examples with raw STS sequences as input
## ========================================

## Loading a data frame with sequence data in the columns 13 to 24
data(actcal)

## Converting to SPS format
actcal.SPS.A <- seqformat(actcal, 13:24, from = "STS", to = "SPS")
head(actcal.SPS.A)

## Converting to compressed SPS format with no
## prefix/suffix and with "/" as state/duration separator
actcal.SPS.B <- seqformat(actcal, 13:24, from = "STS", to = "SPS",
  compress = TRUE, SPS.out = list(xfix = "", sdsep = "/"))
head(actcal.SPS.B)

## Converting to compressed DSS format
actcal.DSS <- seqformat(actcal, 13:24, from = "STS", to = "DSS",
  compress = TRUE)
head(actcal.DSS)


## ==============================================
## Examples with a state sequence object as input
## ==============================================

## Loading a data frame with sequence data in the columns 10 to 25
data(biofam)

## Limiting the number of considered cases to the first 20
biofam <- biofam[1:20, ]

## Creating a state sequence object
biofam.labs <- c("Parent", "Left", "Married", "Left/Married",
  "Child", "Left/Child", "Left/Married/Child", "Divorced")
biofam.short.labs <- c("P", "L", "M", "LM", "C", "LC", "LMC", "D")
biofam.seq <- seqdef(biofam, 10:25, alphabet = 0:7,
  states = biofam.short.labs, labels = biofam.labs)

## Converting to SPELL format
bf.spell <- seqformat(biofam.seq, from = "STS", to = "SPELL",
  pdata = biofam, pvar = c("idhous", "birthyr"))
head(bf.spell)

## Converting to shifted replicated sequences (SRS)
bf.srs <- seqformat(biofam, var=10:25, from="STS", to="SRS", 
                    covar=c("sex","plingu02"))
tail(bf.srs)


## ======================================
## Examples with SPELL sequences as input
## ======================================

## Loading two data frames: bfspell20 and bfpdata20
## bfspell20 contains the first 20 biofam sequences in SPELL format
## bfpdata20 contains the IDs and the years at which the
## considered individuals were aged 15
data(bfspell)

## Converting to STS format with alignement on calendar years
bf.sts.y <- seqformat(bfspell20, from = "SPELL", to = "STS",
  id = "id", begin = "begin", end = "end", status = "states",
  process = FALSE)
head(bf.sts.y)

## Converting to STS format with alignement on ages
bf.sts.a <- seqformat(bfspell20, from = "SPELL", to = "STS",
  id = "id", begin = "begin", end = "end", status = "states",
  process = TRUE, pdata = bfpdata20, pvar = c("id", "when15"),
  limit = 16)
names(bf.sts.a) <- paste0("a", 15:30)
head(bf.sts.a)


## ==================================
## Examples for TSE and SPELL output
## in presence of missing values
## ==================================

data(ex1) ## STS data with missing values
## creating the state sequence object with by default
## the end missings coded as void ('%')
sqex1 <- seqdef(ex1[,1:13])
as.matrix(sqex1)

## Creating state-event transition matrices
ttrans <- seqetm(sqex1, method='transition')
tstate <- seqetm(sqex1, method='state')

## Converting into time stamped events
seqformat(sqex1, from = "STS", to = "TSE", tevent = ttrans)
seqformat(sqex1, from = "STS", to = "TSE", tevent = tstate)

## Converting into vertical spell data
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=TRUE)
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=TRUE, right=NA)
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=FALSE)

Run the code above in your browser using DataLab