• • 上一篇    下一篇

机器学习方法研究肝癌预测问题

胡雪梅1,2, 李佳丽1,2, 蒋慧凤3   

  1. 1. 重庆工商大学数学与统计学院, 重庆 400067;
    2. 经济社会应用统计重庆市重点实验室, 重庆 400067;
    3. 长江上游经济研究中心, 重庆 400067
  • 收稿日期:2021-04-09 修回日期:2021-09-21 出版日期:2022-02-25 发布日期:2022-03-21
  • 通讯作者: 李佳丽,Email:lijiali7777@163.com.
  • 基金资助:
    重庆市第五批高等学校优秀人才支持计划《基于分类方法预测股价的趋势运动》,重庆市科委基础研究与前沿探索一般项目(cstc.2018jcyjA2073),重庆市“统计学”研究生导师团队(yds183002),重庆市社会科学规划项目(2019WT59),社会经济应用统计重庆市重点实验室平台开放项目(KFJJ2018066),重庆市教委科学技术研究计划重大项目(KJZD-M202100801)和重庆工商大学数理统计团队(ZDPTTD201906)资助课题.

胡雪梅, 李佳丽, 蒋慧凤. 机器学习方法研究肝癌预测问题[J]. 系统科学与数学, 2022, 42(2): 417-433.

HU Xuemei, LI Jiali, JIANG Huifeng. Machine Learning Methods Investigate Liver Cancer Prediction Problem[J]. Journal of Systems Science and Mathematical Sciences, 2022, 42(2): 417-433.

Machine Learning Methods Investigate Liver Cancer Prediction Problem

HU Xuemei1,2, LI Jiali1,2, JIANG Huifeng3   

  1. 1. School of Mathematics and Statistics, Chongqing Technology and Business University, Chongqing 400067;
    2. Chongqing Key Laboratory of Social Economy and Applied Statistics, Chongqing 400067;
    3. Research Center for Economy of Upper Reaches of the Yangtze River, Chongqing 400067
  • Received:2021-04-09 Revised:2021-09-21 Online:2022-02-25 Published:2022-03-21
肝癌在所有癌症中病死率高居第二名.由于机器学习方法能改进疾病预测精度,因此文章将利用它们研究肝癌前期诊断问题,提高肝癌的预测精度.首先选取影响肝癌的10个指标作为预测变量,将579位肝癌患者分为两组:随机抽取492位患者构成训练样本,剩余87位患者构成测试样本.接着利用训练样本建立6个分类器:逻辑回归、$L_{2}$惩罚逻辑回归、支持向量机(Support Vector Machine,SVM)、梯度提升决策树(Gradient Boosting Decision Tree,GBDT)、人工神经网络(Artificial Neural Network,ANN)和极限梯度提升算法(eXtreme Gradient Boosting,XGBoost),其中逻辑回归和$L_{2}$惩罚逻辑回归采用Newton-Raphson算法得到模型参数的迭代加权最小二乘估计,计算患者肿瘤细胞为恶性和良性的概率估计,确定最佳阈值预测肿瘤性状.最后用测试样本计算混淆矩阵、灵敏度和特异度,绘制ROC曲线评价预测精度.结果表明$L_{2}$惩罚逻辑回归预测精度最高,SVM预测精度排第二,XGBoost预测精度排第三,逻辑回归预测精度排第四,GBDT预测精度排第五,ANN和随机森林预测精度最差.
Liver cancer has the second highest fatality rate among all cancers. Machine learning methods can improve the accuracy of disease prediction. Therefore, in this paper we mainly apply machine learning methods to study the pre-diagnosis problem for liver cancer, and improve the prediction accuracy to liver cancer. Firstly, 10 indicators affecting liver cancer are selected as predictors, and 579 liver cancer patients are divided into two groups:A training sample composed of 492 patients are randomly selected, and a testing sample composed of the remaining 87 patients. Then, we take advantage of the training samples to establish six classifiers:Logistic regression, $L_{2}$ penalized logistic regression, Support Vector Machine (SVM), Gradient Boosting Decision Tree (GBDT), Artificial Neural Network (ANN) and eXtreme Gradient Boosting (XGBoost), where logistic regression and $L_{2}$ penalized logistic regression adopt Newton-Raphson algorithm to obtain the iterative weighted least squares estimators for model parameters, calculate the probability estimate of malignant and benign tumor cells in patients, and determine the optimal threshold to predict tumor traits. Finally, the confusion matrix, sensitivity and specificity are calculated by the testing samples, and the ROC curve is drawn to evaluate the prediction accuracy. The results show that in terms of prediction accuracy, $L_{2}$ penalized logistic regression ranks the first, SVM prediction accuracy ranks second, XGBoost prediction accuracy ranks third, logistic regression prediction accuracy ranks fourth, GBDT prediction accuracy ranks fifth, and the prediction accuracies for ANN and random forest are the worst.

MR(2010)主题分类: 

()
[1] Golub T R, Slonim D K, Tamayo P, et al. Molecular classification of cancer:Class discovery and class prediction by gene expression monitoring. Science, 1999, 286(5439):531-537.
[2] Khan J, Wei J S, Ringnér M, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 2001, 7(6):673-679.
[3] Dudoit S, Fridlyand J, Speed T P. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 2002, 97(457):77-87.
[4] Isabelle G, Jason W, Stephen B, et al. Gene selection for cancer classification using support vector machines. Machine Learning, 2002, 46(1-3):389-422.
[5] Lin Y Z, Yu M G, Wang S J, et al. Advanced colorectal neoplasia risk stratification by penalized logistic regression. Statistical Methods in Medical Research, 2016, 25(4):1677-1691.
[6] Morgul M H, Klunk S, Anastasiadou Z, et al. Diagnosis of HCC for patients with cirrhosis using miRNA profiles of the tumor-surrounding tissue-A statistical model based on stepwise penalized logistic regression. Experimental and Molecular Pathology, 2016, 101(2):165-171.
[7] Huang M W, Chen C W, Lin W C, et al. SVM and SVM ensembles in breast cancer prediction. The Public Library of Science One, 2017, 12(1):1-14.
[8] Morshid A, Elsayes Khaled M, Khalaf Ahmed M, et al. A machine learning model to predict hepatocellular carcinoma response to transcatheter arterial chemoembolization. Radiology Artificial Intelligence, 2019, 1(5):1-9.
[9] Liao H T, Xiong T Y, Peng J J, et al. Classification and prognosis prediction from histopathological images of hepatocellular carcinoma by a fully automated pipeline based on machine learning. Annals of Surgical Oncology, 2020, 27(8):2359-2369.
[10] Lu M Y, Fan Z J, Xu B, et al. Using machine learning to predict ovarian cancer. International Journal of Medical Informatics, 2020, 21(5):141-148.
[11] Macaulay B O, Aribisala B S, Akande S A, et al. Breast cancer risk prediction in African women using random forest classifier. Cancer Treatment and Research Communications, 2021, 100396-100424.
[12] Zou Z M, Chang D H, Liu H, et al. Current updates in machine learning in the prediction of therapeutic outcome of hepatocellular carcinoma:What should we know? Insights Into Imaging, 2021, 12(1):12-31.
[13] Cessie S L, Van H J C. Ridge estimators in logistic regression. Journal of the Royal Statistical Society:Series C (Applied statistics), 1992, 41(1):191-201.
[14] Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine Learning Research, 2006, 7(21):2541-2563.
[15] Zhu J, Hastie T. Classification of expression arrays by penalized logistic regression. Biostatistics, 2004, 5(3):427-443.
[16] Meier L, Van D G S, Bühlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 2008, 70(1):53-71.
[17] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York:Springer, 2017.
[18] McCullagh P, Nelder J A. Generalized Linear Models. 2nd Edition. London:Chapman Hall, 1989.
[19] 胡雪梅, 刘锋. 高维统计模型的估计理论与模型识别. 北京:高等教育出版社, 2020. (Hu X M, Liu F. Estimation Theory and Model Recognition of High-Dimensional Statistical Models. Beijing:Higher Education Press, 2020.)
[20] Knight K B, Fu W J. Asymptotics for lasso type estimators. The Annals of Statistics, 2000, 28(5):1356-1378.
[21] Joes S, Michael D A, Meindert N, et al. Ridge-based vessel segmentation in color images of the retina. IEEE Transaction on Medical Imaging, 2004, 23(4):501-509.
[22] Tropp A J. Algorithms for simultaneous sparse approximation. Part II:Convex relaxation. Signal Processing, 2006, 86(3):589-602.
[23] Donoho D L. For most large underdetermined systems of linear equations the minimal L1-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 2006, 59(6):797-829.
[24] Meinshausen N. A note on the lasso for Gaussian graphical model selection. Statistics and Probability Letters, 2007, 78(7):880-884.
[25] Lee A H, Silvapulle M J. Ridge estimation in logistic regression. Communications in Statistics-Simulation and Computation, 1988, 17(4):1231-1257.
[26] Duffy D E, Santner T J. On the small sample properties of norm-restricted maximum likelihood estimators for logistic regression models. Communication in Statistics-Theory and Methods, 1989, 18(3):959-980.
[27] Cortes C, Vapnik V. Support vector networks. Machine Learning, 1995, 20(3):273-297.
[28] 刘靖旭, 蔡怀平, 谭跃进. 支持向量回归参数调整的一种启发式算法. 系统仿真学报, 2007, (7):1540-1543. (Liu J X, Cai H P, Tan Y J. A heuristic algorithm for parameter adjustment of support vector regression. Journal of System Simulation, 2007, (7):1540-1543.)
[29] Friedman J H. Greedy function approximation:A gradient boosting machine. Annals of Statistics, 2001, 29(5):1189-1232.
[30] McCulloch W S, Pitts W. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 1943, 5(4):115-133.
[31] 胡雪梅, 蒋慧凤. 具有技术指标的逻辑回归模型预测谷歌股票的涨跌趋势. 系统科学与数学, 2021, 41(3):802-823. (Hu X M, Jiang H F. Logistic regression model with technical indicators predicts ups and downs for google stock prices. Journal of Systems Science and Mathematical Sciences, 2021, 41(3):802-823.)
[32] Chen T Q, Guestrin C. XGBoost:A scalable tree boosting system. The 22nd ACM SIGKDD International Conference, 2016, 13(8):785-794.
[1] 张婷婷, 王沫然, 魏得胜, 刘志峰. 季节调整FWA-SVR模型及其在旅游经济预测中的应用[J]. 系统科学与数学, 2021, 41(6): 1572-1584.
[2] 胡雪梅, 蒋慧凤. 具有技术指标的逻辑回归模型预测谷歌股票的涨跌趋势[J]. 系统科学与数学, 2021, 41(3): 802-823.
[3] 李萍,倪志伟,朱旭辉,宋娟. 基于改进萤火虫算法的SVR空气污染物浓度预测模型[J]. 系统科学与数学, 2020, 40(6): 1020-1036.
[4] 韩璐,苏治,刘志东. 金融市场的协动预测模型: DWT-SVM方法[J]. 系统科学与数学, 2020, 40(12): 2342-2356.
[5] 于静,韩鲁青. 一种改进的求解支持向量机模型的坐标梯度下降算法[J]. 系统科学与数学, 2018, 38(5): 583-590.
[6] 张文,崔杨波,姜祎盼. 基于SVM$^{K\text{-}{\rm Means}}$的非均衡P2P网贷平台风险预测研究[J]. 系统科学与数学, 2018, 38(3): 364-378.
[7] 李萍,倪志伟,朱旭辉,伍章俊. 基于分形流形学习的支持向量机空气污染指数预测模型[J]. 系统科学与数学, 2018, 38(11): 1296-1306.
[8] 唐振鹏,黄双双,陈尾虹. 基于支持向量机的银行系统重要性评估研究[J]. 系统科学与数学, 2018, 38(1): 57-77.
[9] 朱旭辉,倪志伟,倪丽萍,程美英,李敬明,金飞飞. 基于相异度的SVM选择性集成雾霾天气预测方法[J]. 系统科学与数学, 2017, 37(6): 1480-1493.
[10] 王勇,董恒新. 大数据背景下中国季度失业率的预测研究------基于网络搜索数据的分析[J]. 系统科学与数学, 2017, 37(2): 460-472.
[11] 张燕,张晨光,张夏欢. 平衡化图半监督学习方法[J]. 系统科学与数学, 2016, 36(8): 1107-1118.
[12] 朱旭辉,倪志伟,程美英. 基于人工鱼群和分形学习的雾霾天气预报方法[J]. 系统科学与数学, 2016, 36(11): 1887-1901.
[13] 张少白,曾又,刘友谊. 基于DIVA模型的脑电信号识别方法[J]. 系统科学与数学, 2015, 35(5): 489-498.
[14] 张国山,王一鸣,王世伟,刘万泉. 常微分方程近似解的LS-SVM改进求法[J]. 系统科学与数学, 2013, 33(6): 695-707.
[15] 许洪贵;赵琨;田英杰. 鲁棒半监督$\nu$-支持向量分类机[J]. 系统科学与数学, 2010, 30(2): 265-273.
阅读次数
全文


摘要