Principal component analysis (PCA) is the tool of choice for summarising multivariate and high-dimensional data as features in a lower-dimensional space. PCA works well for Gaussian data, but may not do so well for high-dimensional, skewed or heavy-tailed data or data with outliers as encountered in practice. The availability of complex data has enhanced these shortcomings and increased the demand for PC approaches that perform well for such data. The purpose of this paper is to critically appraise a class of interpretable PC candidates which can respond to this demand and to compare their performance to that of standard PCA. Among the large variety of nonlinear PCA, we concentrate on the subclass that is based on spherical covariance matrices. This subclass includes the spatial sign, spatial rank, and Kendall’s τ covariance matrix. We focus on three key aspects: population concepts and their properties; sample-based estimators; and actual practice based on the analysis of real and simulated data. At the population level we consider relationships between the standard covariance matrix and spherical covariance matrices. For the random sample we consider natural estimators of the population eigenvectors, look at appropriate distributional models, highlight relationships between different estimators and relate properties of estimators and their population analogues. We complement the theory we present with new analyses of multivariate and high-dimensional real data as well as simulated data from diverse distributions which elucidates behaviour patterns of spherical PCA for elliptic and non-elliptic distributions. The latter are not captured in the theoretical framework, and their inclusion therefore offers fresh insight into the performance of spherical PCA. The combination of the theory and the new analysis evidence that PCA of rank-based covariances severely outperforms that based on the potentially unstable spatial sign covariance matrix. Further, the overall good performance of rank-based PCA and its superior properties for data for which the sample covariance matrix has been known to perform poorly make rank-based PCA not only a desirable addition to standard PCA, but render it a serious competitor for dimension reduction and feature selection while retaining features valued in PCA.
Published in | American Journal of Theoretical and Applied Statistics (Volume 11, Issue 4) |
DOI | 10.11648/j.ajtas.20221104.13 |
Page(s) | 122-139 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2022. Published by Science Publishing Group |
Multivariate Ranks, Multivariate Spatial Signs, Nonlinear Covariance Matrices, Performance of Nonlinear PCA, Spherical PCA
[1] | J. I. Marden (1999). “Some robust estimates of principal components,” Statistics & probability letters, vol. 43, no. 4, 349-359. |
[2] | S. Taskinen, I. Koch, and H. Oja (2012). “Robustifying principal component analysis with spatial sign vectors,” Statistics and Probability Letters, vol. 82, 765-774. |
[3] | S. Visuri, V. Koivunen, and H. Oja(2000). “Sign and rank covariance matrices,” Journal of Statistical Planning and Inference, vol. 91, no. 2, 557-575. |
[4] | A. Dürre, D. Vogel, and R. Fried (2015). “Spatial sign correlation,” Journal of Multivariate Analysis, vol. 135, 89-105. |
[5] | D. Gervini (2008). “Robust functional estimation using the median and spherical principal components,” Biometrika, vol. 95, no. 3, 587-600. |
[6] | C. Croux, E. Ollila, and H. Oja (2002). “Sign and rank covariance matrices: statistical properties and application to principal components analysis,” in Statistical data analysis based on the L1-norm and related methods. Springer, 257-269. |
[7] | F. Han and H. Liu (2018). “Eca: High- dimensional elliptical component analysis in non- gaussian distributions,” Journal of the American Statistical Association, vol. 113, no. 521, 252-268. |
[8] | A. Dürre, D. E. Tyler, and D. Vogel (2016). “On the eigenvalues of the spatial sign covariance matrix in more than two dimensions,” Statistics & Probability Letters, vol. 111, 80-85. |
[9] | K. Yu, X. Dang, and Y. Chen (2015). “Robustness of the affine equivariant scatter estimator based on the spatial rank covariance matrix,” Communications in Statistics- Theory and Methods, vol. 44, no. 5, 914-932. |
[10] | P. J. Huber (1981). Robust statistics. John Wiley & Sons. |
[11] | N. Locantore, J. Marron, D. Simpson, N. Tripoli, J. Zhang, K. Cohen, G. Boente, R. Fraiman, B. Brumback, C. Croux et al. (1999). “Robust principal component analysis for functional data,” Test, vol. 8, no. 1, 1-73. |
[12] | S. Visuri, E. Ollila, V. Koivunen, J. Möttoönen, and H. Oja (2003). “Affine equivariant multivariate rank methods,” Journal of Statistical Planning and Inference, vol. 114, no. 1-2, 161-185. |
[13] | A. Dürre, D. Vogel,and D. E. Tyler (2014). “The spatial sign covariance matrix with unknown location,” Journal of Multivariate Analysis, vol. 130, 107-117. |
[14] | L. J. van‘t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend (2002). “Gene expression profiling predicts clinical outcome of breast cancer,” Nature, vol. 415, 530- 536. |
[15] | R. D. Cook and S. Weisberg (1999). Applied Statistics Including Computing and Graphics. New York: Wiley. |
[16] | S. Aeberhard, D. Coomans, and O. de Vel (1992). “Comparison of classifiers in high dimensional settings,” Tech. Rep. no. 92-02, Dept. of Computer Science and Dept of Mathematics and Statistics, James Cook University of North Queensland, data sets collected by Forina et al and available on http://www.kernel- machines.com/. |
[17] | A. Azzalini and A. Capitanio (1999). “Statistical applications of the multivariate skew normal distribution,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 61, no. 3, 579-602. |
APA Style
Inge Koch, Lyron Winderbaum, Kanta Naito. (2022). Principal Component Analysis of Standard and Spherical Covariances from the Population and Random Samples to Real and Simulated Data. American Journal of Theoretical and Applied Statistics, 11(4), 122-139. https://doi.org/10.11648/j.ajtas.20221104.13
ACS Style
Inge Koch; Lyron Winderbaum; Kanta Naito. Principal Component Analysis of Standard and Spherical Covariances from the Population and Random Samples to Real and Simulated Data. Am. J. Theor. Appl. Stat. 2022, 11(4), 122-139. doi: 10.11648/j.ajtas.20221104.13
@article{10.11648/j.ajtas.20221104.13, author = {Inge Koch and Lyron Winderbaum and Kanta Naito}, title = {Principal Component Analysis of Standard and Spherical Covariances from the Population and Random Samples to Real and Simulated Data}, journal = {American Journal of Theoretical and Applied Statistics}, volume = {11}, number = {4}, pages = {122-139}, doi = {10.11648/j.ajtas.20221104.13}, url = {https://doi.org/10.11648/j.ajtas.20221104.13}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajtas.20221104.13}, abstract = {Principal component analysis (PCA) is the tool of choice for summarising multivariate and high-dimensional data as features in a lower-dimensional space. PCA works well for Gaussian data, but may not do so well for high-dimensional, skewed or heavy-tailed data or data with outliers as encountered in practice. The availability of complex data has enhanced these shortcomings and increased the demand for PC approaches that perform well for such data. The purpose of this paper is to critically appraise a class of interpretable PC candidates which can respond to this demand and to compare their performance to that of standard PCA. Among the large variety of nonlinear PCA, we concentrate on the subclass that is based on spherical covariance matrices. This subclass includes the spatial sign, spatial rank, and Kendall’s τ covariance matrix. We focus on three key aspects: population concepts and their properties; sample-based estimators; and actual practice based on the analysis of real and simulated data. At the population level we consider relationships between the standard covariance matrix and spherical covariance matrices. For the random sample we consider natural estimators of the population eigenvectors, look at appropriate distributional models, highlight relationships between different estimators and relate properties of estimators and their population analogues. We complement the theory we present with new analyses of multivariate and high-dimensional real data as well as simulated data from diverse distributions which elucidates behaviour patterns of spherical PCA for elliptic and non-elliptic distributions. The latter are not captured in the theoretical framework, and their inclusion therefore offers fresh insight into the performance of spherical PCA. The combination of the theory and the new analysis evidence that PCA of rank-based covariances severely outperforms that based on the potentially unstable spatial sign covariance matrix. Further, the overall good performance of rank-based PCA and its superior properties for data for which the sample covariance matrix has been known to perform poorly make rank-based PCA not only a desirable addition to standard PCA, but render it a serious competitor for dimension reduction and feature selection while retaining features valued in PCA.}, year = {2022} }
TY - JOUR T1 - Principal Component Analysis of Standard and Spherical Covariances from the Population and Random Samples to Real and Simulated Data AU - Inge Koch AU - Lyron Winderbaum AU - Kanta Naito Y1 - 2022/09/26 PY - 2022 N1 - https://doi.org/10.11648/j.ajtas.20221104.13 DO - 10.11648/j.ajtas.20221104.13 T2 - American Journal of Theoretical and Applied Statistics JF - American Journal of Theoretical and Applied Statistics JO - American Journal of Theoretical and Applied Statistics SP - 122 EP - 139 PB - Science Publishing Group SN - 2326-9006 UR - https://doi.org/10.11648/j.ajtas.20221104.13 AB - Principal component analysis (PCA) is the tool of choice for summarising multivariate and high-dimensional data as features in a lower-dimensional space. PCA works well for Gaussian data, but may not do so well for high-dimensional, skewed or heavy-tailed data or data with outliers as encountered in practice. The availability of complex data has enhanced these shortcomings and increased the demand for PC approaches that perform well for such data. The purpose of this paper is to critically appraise a class of interpretable PC candidates which can respond to this demand and to compare their performance to that of standard PCA. Among the large variety of nonlinear PCA, we concentrate on the subclass that is based on spherical covariance matrices. This subclass includes the spatial sign, spatial rank, and Kendall’s τ covariance matrix. We focus on three key aspects: population concepts and their properties; sample-based estimators; and actual practice based on the analysis of real and simulated data. At the population level we consider relationships between the standard covariance matrix and spherical covariance matrices. For the random sample we consider natural estimators of the population eigenvectors, look at appropriate distributional models, highlight relationships between different estimators and relate properties of estimators and their population analogues. We complement the theory we present with new analyses of multivariate and high-dimensional real data as well as simulated data from diverse distributions which elucidates behaviour patterns of spherical PCA for elliptic and non-elliptic distributions. The latter are not captured in the theoretical framework, and their inclusion therefore offers fresh insight into the performance of spherical PCA. The combination of the theory and the new analysis evidence that PCA of rank-based covariances severely outperforms that based on the potentially unstable spatial sign covariance matrix. Further, the overall good performance of rank-based PCA and its superior properties for data for which the sample covariance matrix has been known to perform poorly make rank-based PCA not only a desirable addition to standard PCA, but render it a serious competitor for dimension reduction and feature selection while retaining features valued in PCA. VL - 11 IS - 4 ER -