HDESND: High Dimensional Estimation in Sparse and Non-sparse Data

In the new era of data science, because of advancements in technology with supercomputers’ aid, the capability to save massive, big, and high-dimensional data comes true and possible. Massive can be interpreted as a dataset that the number of observations (say, n), or the number of variables (say, p) or simultaneously both of them is huge. Two cases are generated: low-dimension (n > p), and high-dimension (n < p). Many traditional statistical methods cannot support the high-dimensional dataset. The high dimensional dataset has wide applications in many science fields, including, but not restricted to, genetic, brain imaging, internet of things, pattern recognition, finance, biochemical studies, etc. Many studies assume the high-dimensional model is sparse. Sparsity refers to the estimator of a parameter/feature-vector which contains many zeros, and in the statistical term, insignificant features. Nevertheless, there are many undiscovered stories in the world of high-dimensional data. Although the high dimensional data intends to be sparse, it is not always true. Thus, understanding the structure (sparse/non-sparse) of high-dimensional data is very important in analyzing and modeling the related data in data science. Based on the structure, different estimators and learners can be employed. However, the main problem is that there is uncertainty about the structure of high-dimensional data. To solve this problem, we propose to use shrinkage methods to obtain more stable estimates for true population parameters, reduce resampling and non-sampling errors, and smooth the spatial fluctuations. Another advantage of shrinkage methods is to use additional sample/non-sample information to achieve a new estimator that dominates the raw/primary estimator, in the sense of having a smaller mean prediction error. An essential property of shrinkage estimators is that if the information is not accurate, the new estimator retains the origin of the estimator’s properties and guarantees that it never behaves worse than the raw estimator. Our proposal establishes shrinkage methods to benefit from the information contained in both sparse and non-sparse estimators more smoothly to improve the original estimators. Briefly, usually, as the sparse regression model is an umbrella term for any regression that penalizes large models and perform variable selection, applying variable selection methods for a high-dimensional dataset leads to reduce the dimension, and it is possible to employ some of the traditional methods on the new data in a lower-dimension. Nevertheless, it is not confident that the sparsity assumption is valid. One of the unsolved problems, which form part of our contribution, is a gap for non-sparse high-dimensional models for defining shrinkage estimators and proposing estimators without dimension reduction under a high-dimensional regime. In a nutshell, the strategy is first to conduct a test whether the given data is sparse or not. Then, according to this test’s outcome, sparse and non-sparse shrinkage methods will be employed for sparse and non-sparse data, respectively. Finally, we train a learner or a method to combine both information for improvement in prediction.

Team: Filipe Marques, Carlos Coelho, Mina Norouzirad (postdoc), and Mohammad Arashi (Ferdowsi University Of Mashhad (FUM), Iran)