seqformat: Conversion between sequence formats

Description

Convert a sequence data set from one format to another.

Usage

seqformat(data, var=NULL, id=NULL,
     from, to, compressed=FALSE,
     nrep=NULL, tevent, stsep=NULL, covar=NULL,
     SPS.in=list(xfix="()", sdsep=","),
     SPS.out=list(xfix="()", sdsep=","),
     begin=NULL, end=NULL, status=NULL,
     process=TRUE, pdata=NULL, pvar=NULL,
     limit=100, overwrite=TRUE,
     fillblanks=NULL, tmin=NULL, tmax=NULL)

Arguments

data

a data frame or matrix containing sequence data.

var

the list of columns containing the sequences. Default is NULL, i.e. all the columns. Whether the sequences are in the compressed (character strings) or extended format is automatically detected by counting the number of columns.

column containing the identification numbers for the sequences. When using SPELL format as input, this identification number is mandatory, in order to identify all spells belonging to each individual in the data set.

from

format of the original data. Available formats are: "STS", "SPS", "SPELL". If data is a sequence object, format is automatically set to "STS".

format of the output data. Available formats are: "STS", "SPS", "SRS", "DSS", "TSE".

compressed

if TRUE and output format is one of "STS", "SPS" or "DSS", the output sequences are compressed into character strings

nrep

number of previous states replicated, for the "SRS" format

tevent

when converting to time-stamped-event ("TSE") format, a matrix of size $d * d$ where $d$ is the number of distinct states appearing in the sequences must be given. In this matrix, the cell $(i,j)$ contains all events associated with a tran

stsep

the character used as separator in the original data if input format is a vector of character strings. If NULL (default value), the seqfcheck function is called for detecting automatically

covar

the list of columns containing associated covariates to be included in the output data frame. If to="SRS" is chosen, the covariates are replicated across each row. Default is NULL.

SPS.in

a list with the characters used as prefix/suffix and state/duration separator for each state duration couple if input data contains sequences in SPS format. Set the xfix element of the list to "" if there are no p

SPS.out

a list with the characters used as prefix/suffix and state/duration separator to be used for each state duration couple if output is in SPS format. Set the xfix element of the list to "" if there are no pre-suf-fi

begin

when converting from SPELL, the column with the beginning position of the spell

end

when converting from SPELL, the column with the end position of the spell

status

when converting from SPELL, the column with the status

process

If TRUE (default) when converting from SPELL, sequences are created on a process time axis. If set to FALSE, they are created on a calendar time axis.

pdata

when converting from SPELL and process=TRUE, either NULL, "auto" or the name of the data frame containing the individual 'birth' time, that is, the entering time from which the process time will be co

pvar

names or numbers of the columns containing the individual identification number and the 'birth' time in pdata.

limit

when converting from SPELL, size of the resulting dataframe when creating age sequences (by default goes from age 1 to age 100)

overwrite

when converting from SPELL, if overwrite is set to TRUE, the most recent episode overwrites the older one if they overlap each other. If set to FALSE, the most recent episode starts from the end of th

fillblanks

when converting from SPELL, if fillblanks is not NULL, gaps between episodes are filled with any character given as argument.

tmin

when converting from SPELL, if sequences are to be defined on a calendar time axis, it defines the starting time of the axis. If set to NULL, the minimum time is taken from the 'begin' column in the data.

tmax

when converting from SPELL, if year sequences are wanted, defines the ending year of the dataframe. If set to NULL, it is guessed from the data (not so accurately!).

Value

a data frame

encoding

latin1

Details

The seqformat function is used to convert data from one format to another. The input data is first converted into the STS format and then converted to the output format. Depending on input and output formats, some information can be lost in the conversion process. The output is a matrix, NOT a sequence object to be passed to TraMineR functions for plotting and mining sequences (use the seqdef function for that). See Gabadinho et al. (2009) and Ritschard et al. (2009) for more details on longitudinal data formats and converting between them.

References

Gabadinho, A., G. Ritschard, M. Studer and N. S. M�ller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva. Ritschard, G., A. Gabadinho, M. Studer and N. S. M�ller. Converting between various sequence representations. in Ras, Z. & Dardzinska, A. (ed.) Advances in Data Management, Springer, 2009, 223, 155-175

Examples

Run this code

## Converting sequences into SPS format
data(actcal)
actcal.SPS.A <- seqformat(actcal,13:24, from="STS", to="SPS")
head(actcal.SPS.A)

## SPS (compressed) format with no prefix/suffix "/" as state/duration separator
actcal.SPS.B <- seqformat(actcal,13:24,
	from="STS", to="SPS", compressed=TRUE,
	SPS.out=list(xfix="", sdsep="/"))
head(actcal.SPS.B)

## Converting sequences into DSS (compressed) format
actcal.DSS <- seqformat(actcal,13:24,
	from="STS", to="DSS", compressed=TRUE)
head(actcal.DSS)

Run the code above in your browser using DataLab