jackstraw: jackstraw: Statistical Inference for Unsupervised Learning

Description

Test for association between the observed data and their systematic patterns of variations, that are often extracted by unsupervised learning. Systematic patterns may be captured by latent variables using principal component analysis (PCA), factor analysis (FA), and related methods. This allows one to, for example, obtain principal components (PCs) and conduct rigorous statistical testing for association between observed variables and PCs. Similarly, unsupervised clustering, such as K-means clustering, partition around medoids (PAM), and other algorithms, finds subpopulations among the observed variables. The jackstraw test can estimate statistical significance of cluster membership, so that one can evaluate the strength of membership assignments. This package also includes several related methods to support statistical inference and probabilistic feature selection for unsupervised learning.

Arguments

Details

The jackstraw package provides a resampling strategy and testing scheme to estimate statistical significance of association between the observed data and their latent variables. Depending on the data type and the analysis aim, the latent variables may be estimated by principal component analysis, K-means clustering, and related algorithms. The jackstraw methods learn over-fitting characteristics inherent in this circular analysis, where the observed data are used to estimate the latent variables and to again test against the estimated latent variables.

The jackstraw tests enable us to identify the data features (i.e., variables or observations) that are driving systematic variation, in a completely unsupervised manner. Using jackstraw_pca, we can find statistically significant features with regard to the top r principal components. Alternatively, jackstraw_kmeans can identify the data features that are statistically significant members of the data-dependent clusters. Furthermore, this package includes more general algorithms such as jackstraw_subspace for the dimension reduction techniques and jackstraw_cluster for the clustering algorithms.

Overall, it computes m p-values of association between the m data features and their corresponding latent variables. From m p-values, pip computes posterior inclusion probabilities, that are useful for feature selection and visualization.

References

Chung and Storey (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 http://bioinformatics.oxfordjournals.org/content/31/4/545

Chung (2018) Statistical significance for cluster membership. biorxiv, doi:10.1101/248633 https://www.biorxiv.org/content/early/2018/01/16/248633

Description

Arguments

Details

References

See Also