When confronted with a large number $p$ of variables measuring
different aspects of a same theme, the practitionner may like to
summarize the information into a limited number $q$ of components. A
component is a linear combination of the original variables, and
the weights in this linear combination are called the loadings.
Thus, a system of components is defined by a $p$ times $q$ dimensional
matrix of loadings. Among all systems of components, principal components (PCs) are
optimal in many ways. In particular, the first few PCs extract a
maximum of the variability of the original variables and they are
uncorrelated, such that the extracted information is organized in an
optimal way: we may look at one PC after the other, separately,
without taking into account the rest.
Unfortunately PCs are often difficult to interpret. The goal of Simple
Component Analysis is to replace (or to supplement) the optimal but
non-interpretable PCs by suboptimal but interpretable simple
components. The proposal of Rousson and Gasser (2003) is to look for
an optimal system of components, but only among the simple ones,
according to some definition of optimality and simplicity. The outcome
of their method is a simple matrix of loadings calculated from the
correlation matrix $S$ of the original variables.
Simplicity is not a guarantee for interpretability (but it helps in
this regard). Thus, the user may wish to partly modify an optimal
system of simple components in order to enhance
interpretability. While PCs are by definition 100% optimal, the
optimal system of simple components proposed by the procedure sca
may be, say, 95%, optimal, whereas the simple system altered by the
user may be, say, 93% optimal. It is ultimately to the user to decide
if the gain in interpretability is worth the loss of optimality.
The interactive procedure sca is intended to assist the user in
his/her choice for an interptetable system of simple components. The
algorithm consists of three distinct stages and proceeds in an
interative way. At each step of the procedure, a simple matrix of
loadings is displayed in a window. The user may alter this matrix by
clicking on its entries, following the instructions given there. If
all the loadings of a component share the same sign, it is a
``block-component''. If some loadings are positive and some loadings
are negative, it is a ``difference-component''. Block-components are
arguably easier to interpret than
difference-components. Unfortunately, PCs almost always contain only
one block-component. In the procedure sca, the user may choose the
number of block-components in the system, the rationale being to have
as many block-components such that correlations among them are below
some cut-off value (typically .3 or .4).
Simple block-components should define a partition of the original
variables. This is done in the first stage of the procedure sca. An
agglomerative hierarchical clustering procedure is used there.
The second stage of the procedure sca consists in the definition of
simple difference-components. Those are obtained as simplified
versions of some appropriate ``residual components''. The idea is to
retain the large loadings (in absolute value) of these residual
components and to shrink to zero the small ones. For each
difference-component, the interactive procedure sca displays the
loadings of the corresponding residual component (at the right side of
the window), such that the user may know which variables are
especially important for the definition of this component.
At the third stage of the interactive procedure sca, it is possible
to remove some of the difference-components from the system.
For many examples, it is possible to find a simple system which is 90%
or 95% optimal, and where correlations between components are below 0.3
or 0.4. When the structure in the correlation matrix is complicated, it
might be advantageous to invert the sign of some of the variables in
order to avoid as much as possible negative correlations. This can be
done using the option `invertsigns=TRUE'.
In principle, simple components can be calculated from a correlation
matrix or from a variance-covariance matrix. However, the definition
of simplicity used is not well adapted to the latter case, such that
it will result in systems which are far from being 100%
optimal. Thus, it is advised to define simple components from a
correlation matrix, not from a variance-covariance matrix.