
基于随机森林算法的两阶段变量选择研究
Two-Stage Stepwise Variable Selection Based on Random Forests
变量选择在高维数据处理中尤为重要,其中变量的重要性评级是关键问题.文章提出基于随机森林两阶段逐步变量选择算法.第一阶段提出变量重要性排序改进方法,目的进一步提高重要变量与噪声变量的区分度.第二阶段基于随机森林的逐步变量选择.通过模拟数据验证该方法的有效性和可行性.对水稻数据QTL定位进行实证研究,将基于两阶段随机森林逐步变量选择算法与SCAD、Elastic Net、传统QTL定位WinQTLcart2.5 软件的运行结果比较,发现基于随机森林两阶段逐步变量选择算法能有效筛选变量.
Variable selection is particularly important in high dimensional data processing, and the variable importance measure is a key problem. In this paper, we propose an algorithm of two-stage stepwise variable selection based on random forests (abbreviate as TSRF). The first stage is a new variable importance measure. The aim is to improve the dipartite degree between important variables and noise variables. The second stage is the improvement method of stepwise variable selection based on random forests. The feasibility and efficiency of the method are verified by Monte Carlo simulations. Example analysis on grains per panicle data in rice, we also apply the SCAD penalized regression and Elastic Net regression to dissect the example. Meanwhile, WinQTLcart2.5 that quantitative trait locus mapping software is used to analyse grains per panicle data. The result showed that TSRF can be effectively used for variable selection.
随机森林 / 变量选择 / 变量重要性 / QTL 定位. {{custom_keyword}} /
/
〈 |
|
〉 |