sca: Simple Component Analysis -- Interactively

Description

A system of simple components calculated from a correlation (or variance-covariance) matrix is built (interactively if

interactive =
    TRUE

) following the methodology of Rousson and Gasser (2003).

Usage

sca(S, b = if(interactive) 5, d = 0, qmin = if(interactive) 0 else 5,
    corblocks = if(interactive) 0 else 0.3,
    criterion = c("csv", "blp"), cluster = c("median","single","complete"),
    withinblock = TRUE, invertsigns = FALSE,
    interactive = dev.interactive())
## S3 method for class 'simpcomp':
print(x, ndec = 2, \dots)

Arguments

the correlation (or variance-covariance) matrix to be analyzed.

the number of block-components initially proposed.

the number of difference-components initially proposed.

qmin

if larger than zero, the number of difference-components is chosen such that the system contains at least qmin components (overriding argument d!).

corblocks

if larger than zero, the number of block-components is chosen such that correlations among them are all smaller than corblocks (overriding argument b).

criterion

character string specifying the optimality criterion to be used for evaluating a system of simple components. One of "csv" (corrected sum of variances) or "blp" (best linear predictor); can be abbreviated.

cluster

character string specifying the clustering method to be used in the definition of the block-components. One of "single" (single linkage), "median" (median linkage) or "complete" (complete linkage) can be

withinblock

a logical indicating whether any given difference-component should only involve variables belonging to the same block-component.

invertsigns

a logical indicating whether the sign of some variables should be inverted initially in order to avoid negative correlations.

interactive

a logical indicating whether the system of simple components should be built interactively. If interactive=FALSE, an optimal system of simple components is automatically calculated without any intervention of the user (according

an object of class sca, typically the result of sca(..).

ndec

number of decimals after the dot, for the percentages printed.

...

further arguments, passed to and from methods.

Value

An object of class simpcomp which is basically as list with the following components:
simplematan integer matrix defining a system of simple components. The rows correspond to variables and the columns correspond to components.
loadingsloadings of simple components. This is a normalized (by normmatrix) version of simplemat.
allcrita list containing the following components: [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
nblocknumber of block-components in simplemat.
ndiffnumber of difference-components in simplemat.
criterionas above.
clusteras above.
withinblockas above.
invertsignsas above
vardatathe correlation (or variance-covariance) matrix which was analyzed. In principle it should be equal to argument S above, except if it has been transformed in order to avoid negative correlations.

Details

When confronted with a large number $p$ of variables measuring different aspects of a same theme, the practitionner may like to summarize the information into a limited number $q$ of components. A component is a linear combination of the original variables, and the weights in this linear combination are called the loadings. Thus, a system of components is defined by a $p$ times $q$ dimensional matrix of loadings.

Among all systems of components, principal components (PCs) are optimal in many ways. In particular, the first few PCs extract a maximum of the variability of the original variables and they are uncorrelated, such that the extracted information is organized in an optimal way: we may look at one PC after the other, separately, without taking into account the rest.

Unfortunately PCs are often difficult to interpret. The goal of Simple Component Analysis is to replace (or to supplement) the optimal but non-interpretable PCs by suboptimal but interpretable simple components. The proposal of Rousson and Gasser (2003) is to look for an optimal system of components, but only among the simple ones, according to some definition of optimality and simplicity. The outcome of their method is a simple matrix of loadings calculated from the correlation matrix $S$ of the original variables.

Simplicity is not a guarantee for interpretability (but it helps in this regard). Thus, the user may wish to partly modify an optimal system of simple components in order to enhance interpretability. While PCs are by definition 100% optimal, the optimal system of simple components proposed by the procedure sca may be, say, 95%, optimal, whereas the simple system altered by the user may be, say, 93% optimal. It is ultimately to the user to decide if the gain in interpretability is worth the loss of optimality.

The interactive procedure sca is intended to assist the user in his/her choice for an interptetable system of simple components. The algorithm consists of three distinct stages and proceeds in an interative way. At each step of the procedure, a simple matrix of loadings is displayed in a window. The user may alter this matrix by clicking on its entries, following the instructions given there. If all the loadings of a component share the same sign, it is a ``block-component''. If some loadings are positive and some loadings are negative, it is a ``difference-component''. Block-components are arguably easier to interpret than difference-components. Unfortunately, PCs almost always contain only one block-component. In the procedure sca, the user may choose the number of block-components in the system, the rationale being to have as many block-components such that correlations among them are below some cut-off value (typically .3 or .4).

Simple block-components should define a partition of the original variables. This is done in the first stage of the procedure sca. An agglomerative hierarchical clustering procedure is used there.

The second stage of the procedure sca consists in the definition of simple difference-components. Those are obtained as simplified versions of some appropriate ``residual components''. The idea is to retain the large loadings (in absolute value) of these residual components and to shrink to zero the small ones. For each difference-component, the interactive procedure sca displays the loadings of the corresponding residual component (at the right side of the window), such that the user may know which variables are especially important for the definition of this component.

At the third stage of the interactive procedure sca, it is possible to remove some of the difference-components from the system.

For many examples, it is possible to find a simple system which is 90% or 95% optimal, and where correlations between components are below 0.3 or 0.4. When the structure in the correlation matrix is complicated, it might be advantageous to invert the sign of some of the variables in order to avoid as much as possible negative correlations. This can be done using the option `invertsigns=TRUE'.

In principle, simple components can be calculated from a correlation matrix or from a variance-covariance matrix. However, the definition of simplicity used is not well adapted to the latter case, such that it will result in systems which are far from being 100% optimal. Thus, it is advised to define simple components from a correlation matrix, not from a variance-covariance matrix.

References

Rousson, Valentin and Gasser, Theo (2004) Simple Component Analysis. JRSS: Series C (Applied Statistics) 53(4), 539--555; DOI: 10.1111/j.1467-9876.2004.05359.x

Rousson, V. and Gasser, Th. (2003) Some Case Studies of Simple Component Analysis. Manuscript, available as http://www.biostat.uzh.ch/research/manuscripts/scacases.pdf

Gervini, D. and Rousson, V. (2003) Some Proposals for Evaluating Systems of Components in Dimension Reduction Problems. Submitted.

Examples

Run this code

data(pitpropC)
sc.pitp <- sca(pitpropC, interactive=FALSE)
sc.pitp
## to see it's low-level components:
str(sc.pitp)

## Let `X' be a matrix containing some data set whose rows correspond to
## subjects and whose columns correspond to variables. For example:
library(MASS)
Sig <- function(p, rho) { r <- diag(p); r[col(r) != row(r)] <- rho; r}
rmvN <- function(n,p, rho)
        mvrnorm(n, mu=rep(0,p), Sigma= Sig(p, rho))
X <- cbind(rmvN(100, 3, 0.7),
           rmvN(100, 2, 0.9),
           rmvN(100, 4, 0.8))
## An optimal simple system with at least 5 components for the data in `X',
## where the number of block-components is such that correlations among
## them are all smaller than 0.4, can be automatically obtained as:

(r <- sca(cor(X), qmin=5, corblocks=0.4, interactive=FALSE))

## On the other hand, an optimal simple system with two block-components
## and two difference-components for the data in `X' can be automatically
## obtained as:

(r <- sca(cor(X), b=2, d=2, qmin=0, corblocks=0, interactive=FALSE))

## The resulting simple matrix is contained in `r$simplemat'.
## A matrix of scores for such simple components can then be obtained as:

(Z <- scale(X) %*% r$loadings)

## On the other hand, scores of simple components calculated from the
## variance-covariance matrix of `X' can be obtained as:

r <- sca(var(X), b=2, d=2, qmin=0, corblocks=0, interactive=FALSE)
Z <- scale(X, scale=FALSE) %*% r$loadings

## One can also use the program interactively as follows:

if(interactive()) {
  r <- sca(cor(X), corblocks=0.4, qmin=5, interactive = TRUE)

  ## Since the interactive part of the program is active here, the proposed
  ## system can then be  modified according to the user's wishes. The
  ## result of the procedure will be contained in `r'.
}

Run the code above in your browser using DataLab