• • 上一篇    下一篇

大规模数据下子抽样模型平均估计理论

宗先鹏1,王彤彤2   

  1. 1. 北京工业大学理学部,北京 100124; 2. 首都师范大学数学科学学院, 北京 100048
  • 出版日期:2021-12-28 发布日期:2021-12-28

宗先鹏, 王彤彤. 大规模数据下子抽样模型平均估计理论[J]. 系统科学与数学, 2022, 42(1): 109-132.

ZONG Xianpeng, WANG Tongtong. Sub-Sampling Model Averaging Theory for Large Scale Data[J]. Journal of Systems Science and Mathematical Sciences, 2022, 42(1): 109-132.

Sub-Sampling Model Averaging Theory for Large Scale Data

ZONG Xianpeng1 ,WANG Tongtong2   

  1. 1. Faculty of Science, Beijing University of Technology, Beijing 100124; 2. School of Mathematical Sciences, Capital Normal University, Beijing 100048
  • Online:2021-12-28 Published:2021-12-28
随着信息时代的来临, 如何从海量数据中快速、有效地挖掘有用信息是目前面临的新挑战. 子抽样方法作为大规模数据分析的有效工具, 已经受到国内外学者的广泛关注. 不过, 传统的子抽样方法通常没有考虑到模型的不确定性. 当模型假设不正确时, 后面的统计推断将会出现偏差, 甚至导致错误的结论. 为了解决该问题, 文章利用频率模型平均的方法构建了子抽样模型平均估计(简称SSMA估计). 理论上, 文章证明了SSMA估计是全部数据下模型平均估计的一个渐近无偏且相合的估计. 另外, 我们 基于Hansen (2007)的Mallows模型平均方法提出了SSMA估计的权重选择 准则, 并证明了方差已知和未知时权重估计的渐近最优性. 在这些理论性 质的研究中, 文章同时考虑了模型和抽样设计带来的双重随机性. 最后, 数值分析进一步说明了所提出方法的有效性.
With the development of information age, how to mine useful information from massive data quickly and effectively is a new challenge. As an effective tool for large scale data analysis, sub-sampling method has attracted extensive attention of scholars at home and abroad. However, the traditional sub-sampling method usually does not take into account the uncertainty of the model. When the assumed model is incorrect, the conclusions may be wrong. In order to solve this problem, a sub-sampling model averaging estimator (SSMA estimator) is constructed by the sampled data. Theoretically, we prove that the SSMA estimator is an asymptotically unbiased and consistent estimator of the model averaging estimator based on full data. In addition, we propose a weight choice criterion for the SSMA estimator, which is based on the Mallows' criterion proposed by Hansen (2007), and derive the asymptotic optimality of the weight estimator. It is worth mentioning that, in the proofs of these theoretical properties, we consider the double randomness brought by the model and sampling design. Finally, numerical analysis further shows the effectiveness of the proposed method.
()
No related articles found!
阅读次数
全文


摘要