seqformat: Conversion between sequence formats

Description

Convert a sequence data set from one format to another.

Usage

seqformat(data, var=NULL, id=NULL,
         from, to, compressed=FALSE,
         nrep=NULL, tevent, stsep=NULL, covar=NULL,
         SPS.in=list(xfix="()", sdsep=","),
         SPS.out=list(xfix="()", sdsep=","),
         begin=NULL, end=NULL, status=NULL,
         process=TRUE, pdata=NULL, pvar=NULL,
         limit=100, overwrite=TRUE,
         fillblanks=NULL, tmin=NULL, tmax=NULL, nr="*")

Arguments

data

a data frame or matrix containing sequence data.

var

List of columns with the sequence data. Default is NULL, i.e., all columns. Sequences are assumed to be in compressed form (character strings) when there is a single column and in extended form otherwise.

Column containing the 'id' of the sequences. Mandatory with from="SPELL" in order to identify the spells of a same sequence.

from

Format of the input data. One of "STS", "SPS", "SPELL". If data is a sequence object, format is automatically set to "STS".

Format for output data. One of "STS", "SPS", "SRS", "DSS", "TSE".

compressed

Logical. Should "STS", "SPS" or "DSS" output be compressed into character strings? Ignored for other output formats.

nrep

Number of shifted replications for output in "SRS" format.

tevent

Transition definition matrix for converting to time-stamped-event ("TSE") format. Should be a matrix of size $d * d$ where $d$ is the number of distinct states appearing in the sequences. In this matrix, the cell $(i,j)$ lists the events asso

stsep

Separator character between successive elements in compressed (character strings) input data. If NULL (default value), the seqfcheck function is called for detecting automatically a separator

covar

When from="STS" or from="SPS", additional column names to be included as covariates in the output data frame. When to="SRS" the covariates are replicated across the shifted replicated rows. Default is NULL

SPS.in

List with the xfix= and sdsep= specifications for the state-duration couples in input data in SPS form. The first specification, xfix, specifies the prefix/suffix character (use a two-character string if

SPS.out

List with the xfix and sdsep specifications for output in SPS format. (see argument SPS.in above.)

Symbol used for missing state in input "SPS" format which will be converted to NA in "STS" representation.

begin

When converting from SPELL, the column with the beginning position of the spell

end

When converting from SPELL, the column with the end position of the spell

status

When converting from SPELL, the column with the status

process

Logical: When converting from SPELL, should sequences be created on a process time axis? Default is TRUE. Set as FALSE for creating sequences on a calendar time axis.

pdata

When converting from SPELL and process=TRUE, either NULL, "auto" or the name of the data frame containing the individual 'birth' time, that is, the initial time from which the process time will be comput

pvar

When pdata is a data frame, a vector of two names or numbers, the first one specifying the column with the individual 'id', and the second one the 'birth' time.

limit

When converting from SPELL, size of the resulting data frame when creating age sequences (by default ranges from age 1 to age 100)

overwrite

When converting from SPELL, if overwrite is set to TRUE, the most recent episode overwrites the older one when they overlap each other. If set to FALSE, the most recent episode starts in case of overl

fillblanks

When converting from SPELL, if fillblanks is not NULL, gaps between episodes are filled with the fillblanks character value.

tmin

Integer. When converting from SPELL with process=FALSE, defines the starting time of the axis. If set as NULL, the minimum time is taken from the begin column in the data.

tmax

Integer. When converting from SPELL with process=FALSE, defines the ending time. If set as NULL, the value is guessed from the data (not so accurately!).

Value

A data frame

encoding

latin1

Details

The seqformat function is used to convert data from one format to another. The input data is first converted into the STS format and then converted to the output format. Depending on input and output formats, some information can be lost in the conversion process. The output is a matrix, NOT a sequence object to be passed to TraMineR functions for plotting and mining sequences (use the seqdef function for that). See Gabadinho et al. (2009) and Ritschard et al. (2009) for more details on longitudinal data formats and converting between them.

References

Gabadinho, A., G. Ritschard, M. Studer and N. S. M�ller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva. Ritschard, G., A. Gabadinho, M. Studer and N. S. M�ller. Converting between various sequence representations. in Ras, Z. & Dardzinska, A. (ed.) Advances in Data Management, Springer, 2009, 223, 155-175.

Examples

Run this code

## Converting sequences into SPS format
data(actcal)
actcal.SPS.A <- seqformat(actcal,13:24, from="STS", to="SPS")
head(actcal.SPS.A)

## SPS (compressed) format with no prefix/suffix "/" as state/duration separator
actcal.SPS.B <- seqformat(actcal,13:24,
	from="STS", to="SPS", compressed=TRUE,
	SPS.out=list(xfix="", sdsep="/"))
head(actcal.SPS.B)

## Converting sequences into DSS (compressed) format
actcal.DSS <- seqformat(actcal,13:24,
	from="STS", to="DSS", compressed=TRUE)
head(actcal.DSS)

Run the code above in your browser using DataLab