• 论文 • 上一篇    

基于随机森林算法的两阶段变量选择研究

冯盼峰,温永仙   

  1. 福建农林大学计算机与信息学院,福州 350002
  • 出版日期:2018-01-25 发布日期:2018-03-06

冯盼峰,温永仙. 基于随机森林算法的两阶段变量选择研究[J]. 系统科学与数学, 2018, 38(1): 119-130.

FENG Panfeng, WEN Yongxian. Two-Stage Stepwise Variable Selection Based on Random Forests[J]. Journal of Systems Science and Mathematical Sciences, 2018, 38(1): 119-130.

Two-Stage Stepwise Variable Selection Based on Random Forests

FENG Panfeng, WEN Yongxian   

  1. College of Computer and Information Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002
  • Online:2018-01-25 Published:2018-03-06

变量选择在高维数据处理中尤为重要,其中变量的重要性评级是关键问题.文章提出基于随机森林两阶段逐步变量选择算法.第一阶段提出变量重要性排序改进方法,目的进一步提高重要变量与噪声变量的区分度.第二阶段基于随机森林的逐步变量选择.通过模拟数据验证该方法的有效性和可行性.对水稻数据QTL定位进行实证研究,将基于两阶段随机森林逐步变量选择算法与SCAD、Elastic Net、传统QTL定位WinQTLcart2.5 软件的运行结果比较,发现基于随机森林两阶段逐步变量选择算法能有效筛选变量.

Variable selection is particularly important in high dimensional data processing, and the variable importance measure is a key problem. In this paper, we propose an algorithm of two-stage stepwise variable selection based on random forests (abbreviate as TSRF). The first stage is a new variable importance measure. The aim is to improve the dipartite degree between important variables and noise variables. The second stage is the improvement method of stepwise variable selection based on random forests. The feasibility and efficiency of the method are verified by Monte Carlo simulations. Example analysis on grains per panicle data in rice, we also apply the SCAD penalized regression and Elastic Net regression to dissect the example. Meanwhile, WinQTLcart2.5 that quantitative trait locus mapping software is used to analyse grains per panicle data. The result showed that TSRF can be effectively used for variable selection.

()
[1] 闫懋博,田茂再. 多种分布下选择后变量显著性分析及其在CEPS数据中的应用[J]. 系统科学与数学, 2020, 40(1): 141-155.
[2] 林鹏. 一般线性混合效应模型的随机效应选择研究[J]. 系统科学与数学, 2015, 35(6): 617-626.
阅读次数
全文


摘要