dissrep: Extracting sets of representative objects using a dissimilarity matrix

Description

The function extracts a set of representative objects that exhibits the key features of the whole data set, the goal being to get easy sounded interpretation of the latter. The user can set either the desired coverage level (the proportion of objects having a representative in their neighborhood) or the desired number of representatives.

Usage

dissrep(diss, criterion="density",
    score=NULL, decreasing=TRUE,
    trep=0.25, nrep=NULL, tsim=0.1, dmax=NULL, weights=NULL)

Arguments

diss

A dissimilarity matrix or a dist object (see dist)

criterion

the representativeness criterion for sorting the candidate list. One of "freq" (frequency), "density" (neighborhood density) or "dist" (centrality). An optional vector containing the scores for sorting the candida

score

an optional vector containing the representativeness scores used for sorting the objects in the candidate list. The length of the vector must be equal to the number of rows/columns in the distance matrix, i.e the number of objects.

decreasing

if a score vector is provided, indicates whether the objects in the candidate list must be sorted in ascending or decreasing order of this score. The first object in the candidate list is supposed to be the most representative.

trep

controls the size of the representative set by setting the desired coverage level, i.e the proportion of objects having a representative in their neighborhood. Neighborhood radius is defined by tsim.

nrep

number of representatives. If NULL (default), trep argument is used to control the size of the representative set.

tsim

neighborhood radius as a percentage of the maximum (theoretical) distance dmax. Defaults to 0.1 (10%). Object $y$ is redundant to object $x$ when it is in the neighborhood of $x$, i.e., within a distance tsim*dmax from $x$.

dmax

maximum theoretical distance. Used to derive the neighborhood radius as tsim*dmax. If NULL, the value of dmax is derived from the dissimilarity matrix.

weights

vector of weights of length equal to the number of rows of the dissimilarity matrix. If NULL, equal weights are assigned.

Value

An object of class diss.rep. This is a vector containing the indexes of the representative objects with the following additional attributes:
Scoresa vector with the representative score of each object given the chosen criterion.
Distancesa matrix with the distance of each object to its nearest representative.
Statisticsa data frame with quality measures for each representative: number of objects attributed to the representative, number of object in the representative's neighborhood, mean distance to the representative.
Qualityoverall quality measure.
Print and summary methods are available.

encoding

latin1

Details

The representative set is obtained by an heuristic. Representatives are selected by successively extracting from the sequences sorted by their representativeness score those which are not redundant with already retained representatives. The selection stops when either the desired coverage or the wanted number of representatives is reached. Objects are sorted either by the values provided as score argument, or by specifying one of the following as criterion argument: "freq" (sequence frequency), "density" (neighborhood density), "dist" (centrality). The frequency criterion uses the frequencies as representativeness score. The frequency of an object in the data is computed as the number of other objects with whom the dissimilarity is equal to 0. The more frequent an object the more representative it is supposed to be. Hence, objects are sorted in decreasing frequency order. Indeed, this criterion is the neighborhood (see below) criterion with the neighborhood diameter set to 0. The neighborhood density is the number---density---of sequences in the neighborhood of the object. This requires to set the neighborhood radius tsim. Objects are sorted in decreasing density order. The centrality criterion is the sum of distances to all other objects. The smallest the sum, the most representative the sequence. Use criterion="dist" and nrep=1 to get the medoid and criterion="density" and nrep=1 to get the densest object pattern. For more details, see Gabadinho et al., 2011.

References

Gabadinho A, Ritschard G, Studer M, M�ller NS (2011). "Extracting and Rendering Representative Sequences", In A Fred, JLG Dietz, K Liu, J Filipe (eds.), Knowledge Discovery, Knowledge Engineering and Knowledge Management, volume 128 of Communications in Computer and Information Science (CCIS), pp. 94-106. Springer-Verlag.

Examples

Run this code

## Defining a sequence object with the data in columns 10 to 25
## (family status from age 15 to 30) in the biofam data set
data(biofam)
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)

## Computing the distance matrix
costs <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", sm=costs)

## Representative set using the neighborhood density criterion
biofam.rep <- dissrep(biofam.om)
biofam.rep
summary(biofam.rep)

Run the code above in your browser using DataLab