• • 上一篇    

员工流失的影响因素分析与预测

王冠鹏, 秦双燕, 崔恒建   

  1. 首都师范大学数学科学学院, 北京 100048
  • 收稿日期:2021-06-11 修回日期:2022-01-28 发布日期:2022-07-29
  • 通讯作者: 崔恒建,Email:hjcui@bnu.edu.cn.
  • 基金资助:
    国家自然科学基金重点项目(12031016),国家自然科学基金(11971324,11471223)资助课题.

王冠鹏, 秦双燕, 崔恒建. 员工流失的影响因素分析与预测[J]. 系统科学与数学, 2022, 42(6): 1616-1632.

WANG Guanpeng, QIN Shuangyan, CUI Hengjian. Analysis of Influence Factors and Prediction for Employee Turnover[J]. Journal of Systems Science and Mathematical Sciences, 2022, 42(6): 1616-1632.

Analysis of Influence Factors and Prediction for Employee Turnover

WANG Guanpeng, QIN Shuangyan, CUI Hengjian   

  1. School of Mathematical Sciences, Capital Normal University, Beijing 100048
  • Received:2021-06-11 Revised:2022-01-28 Published:2022-07-29
文章采用高维数据变量筛选的方法对衡量员工离职的诸多因素进行统计分析,并对员工离职情况进行了预测.分别使用了由Cui等(2015)提出的MV (Meanof Variance)方法和LASSO方法对高维数据进行变量筛选,选出了与员工离职相关较为密切的的变量进入分类模型.为保证模型预测结果的准确性,文章选择了支持向量机、随机森林、XGBoost以及Logistic模型四种机器学习模型对员工离职情况进行预测.在100次的实验中,相比于另外的7种组合模型方法,MV变量选择下的随机森林模型的平均分类准确率最高,达到95.43%.通过改变训练集与验证集的比例、抽取80%样本数据、增加随机扰动三种方式来验证上述实验结果,发现仍然是MV方法下的随机森林的平均分类准确率最高,且该组合模型具有较好的稳健性能.
This article adopts high-dimensional variable screening method to make analysis of influence factors for employee turnover, as well as to predict the possibility of employee turnover. For high-dimensional data, MV (mean of variance, see Cui, et al. (2015)) method and LASSO method are used to select variables related to employee turnover, which can be entered the classification model. To ensure the prediction accuracy of the classification model, this paper uses four models including support vector machine, random forest, XGBoost and Logistic model to predict the possibility of employee turnover. In 100 experiments, compared to other 7 models combined with MV method, the average classification accuracy of the random forest model combined with the MV variable selection is more higher, as high as 95.43%. The above experimental results are verified by changing the ratio of training set to validation set, sampling 80% sample data, and adding random disturbances. It is found that the average classification accuracy of random forest model with MV method is still higher, this means the model has robustness.

MR(2010)主题分类: 

()
[1] Cui H, Li R, Zhong W.Model-free feature screening for ultrahigh dimensional discriminant analysis.Journal of the American Statistical Association, 2015, 110(510):630-641.
[2] Akaike H.Information theory and an extension of the maximum likelihood principle.2nd Information Symposium on Information Theory, Springer, 1973.
[3] Schwarz G.Estimating the dimension of a model.Annals of Statistics, 1978, 6:461-464.
[4] Fitzgerald M A, Bergman, C J, Resurreccion A P, et al.Partial least squares regression:A tutorial.Analytica Chimica Acta, 1986, 186:1-17.
[5] Anderson T W.An introduction of multivariate statistical analysis, 3rd Ed.Wiley Series in Probability and Mathematical Statistics, 2003.
[6] Hoerl A E, Kennard R W.Ridge regression:Biased estimation for non-orthogonal problems.Techometrics, 1970, 12:55-68.
[7] Tibshirani R.Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B, 1996, 58(1):273-282.
[8] Zou H.The adaptive lasso and its oracle properties.Journal of the American Statistical Association, 2006, 101(476):1418-1429.
[9] Fan J, Li R.Variable selection via non-concave penalized likelihood and its oracle properties.Journal of the American Statistical Association, 2001, 96(456):1348-1360.
[10] Cui H, Zhong W.A distribution-free test of independence and its application to variable selection.Computational Statistics and Data Analysis, 2019, 139:117-133.
[11] 代春倩,赵良伟,崔恒建.基于MV扫描和Logistic回归下的手机媒体性别营销.统计与管理, 2018, 6:60-66.(Dai C Q, Zhao L W, Cui H J.Mobile media gender marketing based on MV scanning and logistic regression.Statistics and Management, 2018, 6:60-66.)
[12] 茆诗松,程依明,璞晓龙.概率论与数理统计教程(第二版).北京:高等教育出版社, 2011.(Miao S S, Cheng Y M, Pu X L.Probability Theory and Mathematical Statistics Course (Second Edition).Beijing:Higher Education Press, 2011.)
[13] Huang D, Li R, Wang H.Feature screening for ultrahigh dimensional categorical data with applications.Journal of Business and Economics Statistics, 2014, 32(2):237-244.
[14] Breiman L.Randomforests.Machine Learning, 2001, 45:5-32.
[15] 李芸,胡可,董欣雨,等.基于SVM算法的企业员工离职预警研究.中国商论, 2020, 6:20-22.(Li Y, Hu K, Dong X Y, et al.Research on early warning of enterprise employee turnover based on SVM algorithm.China Business Review, 2020, 6:20-22.)
[16] Chen T, Guestrin C.XGBoost:A scalable tree boosting system.The 22nd ACM SIGKDD International Conference, 2016.
[17] Eugene C.Introduce to Deep Learning.Cambridge, MA:MIT Press, 2015.
[1] 李山海, 吴艳雄, 王蓓, 徐岩, 刘玉龙. 基于GA-BP神经网络的信息技术业上市公司的成长性预测[J]. 系统科学与数学, 2022, 42(4): 854-866.
[2] 胡雪梅, 李佳丽, 蒋慧凤. 机器学习方法研究肝癌预测问题[J]. 系统科学与数学, 2022, 42(2): 417-433.
[3] 曹梦娜, 田萍, 李高荣. 最优投资组合的Lasso惩罚分位数回归研究[J]. 系统科学与数学, 2021, 41(9): 2595-2611.
[4] 张婷婷, 王沫然, 魏得胜, 刘志峰. 季节调整FWA-SVR模型及其在旅游经济预测中的应用[J]. 系统科学与数学, 2021, 41(6): 1572-1584.
[5] 胡雪梅, 蒋慧凤. 具有技术指标的逻辑回归模型预测谷歌股票的涨跌趋势[J]. 系统科学与数学, 2021, 41(3): 802-823.
[6] 倪宣明, 邱语宁, 赵慧敏. 基于因子特征的高维稀疏投资组合优化[J]. 系统科学与数学, 2021, 41(10): 2716-2729.
[7] 李萍,倪志伟,朱旭辉,宋娟. 基于改进萤火虫算法的SVR空气污染物浓度预测模型[J]. 系统科学与数学, 2020, 40(6): 1020-1036.
[8] 韩璐,苏治,刘志东. 金融市场的协动预测模型: DWT-SVM方法[J]. 系统科学与数学, 2020, 40(12): 2342-2356.
[9] 王国长,高桃璇,徐世荣. 基于Sparse-Group Lasso 的指数跟踪[J]. 系统科学与数学, 2019, 39(12): 2025-2040.
[10] 于静,韩鲁青. 一种改进的求解支持向量机模型的坐标梯度下降算法[J]. 系统科学与数学, 2018, 38(5): 583-590.
[11] 张文,崔杨波,姜祎盼. 基于SVM$^{K\text{-}{\rm Means}}$的非均衡P2P网贷平台风险预测研究[J]. 系统科学与数学, 2018, 38(3): 364-378.
[12] 李萍,倪志伟,朱旭辉,伍章俊. 基于分形流形学习的支持向量机空气污染指数预测模型[J]. 系统科学与数学, 2018, 38(11): 1296-1306.
[13] 唐振鹏,黄双双,陈尾虹. 基于支持向量机的银行系统重要性评估研究[J]. 系统科学与数学, 2018, 38(1): 57-77.
[14] 冯盼峰,温永仙. 基于随机森林算法的两阶段变量选择研究[J]. 系统科学与数学, 2018, 38(1): 119-130.
[15] 朱旭辉,倪志伟,倪丽萍,程美英,李敬明,金飞飞. 基于相异度的SVM选择性集成雾霾天气预测方法[J]. 系统科学与数学, 2017, 37(6): 1480-1493.
阅读次数
全文


摘要