肝癌在所有癌症中病死率高居第二名.由于机器学习方法能改进疾病预测精度,因此文章将利用它们研究肝癌前期诊断问题,提高肝癌的预测精度.首先选取影响肝癌的10个指标作为预测变量,将579位肝癌患者分为两组:随机抽取492位患者构成训练样本,剩余87位患者构成测试样本.接着利用训练样本建立6个分类器:逻辑回归、惩罚逻辑回归、支持向量机(Support Vector Machine,SVM)、梯度提升决策树(Gradient Boosting Decision Tree,GBDT)、人工神经网络(Artificial Neural Network,ANN)和极限梯度提升算法(eXtreme Gradient Boosting,XGBoost),其中逻辑回归和惩罚逻辑回归采用Newton-Raphson算法得到模型参数的迭代加权最小二乘估计,计算患者肿瘤细胞为恶性和良性的概率估计,确定最佳阈值预测肿瘤性状.最后用测试样本计算混淆矩阵、灵敏度和特异度,绘制ROC曲线评价预测精度.结果表明惩罚逻辑回归预测精度最高,SVM预测精度排第二,XGBoost预测精度排第三,逻辑回归预测精度排第四,GBDT预测精度排第五,ANN和随机森林预测精度最差.
Abstract
Liver cancer has the second highest fatality rate among all cancers. Machine learning methods can improve the accuracy of disease prediction. Therefore, in this paper we mainly apply machine learning methods to study the pre-diagnosis problem for liver cancer, and improve the prediction accuracy to liver cancer. Firstly, 10 indicators affecting liver cancer are selected as predictors, and 579 liver cancer patients are divided into two groups:A training sample composed of 492 patients are randomly selected, and a testing sample composed of the remaining 87 patients. Then, we take advantage of the training samples to establish six classifiers:Logistic regression, penalized logistic regression, Support Vector Machine (SVM), Gradient Boosting Decision Tree (GBDT), Artificial Neural Network (ANN) and eXtreme Gradient Boosting (XGBoost), where logistic regression and penalized logistic regression adopt Newton-Raphson algorithm to obtain the iterative weighted least squares estimators for model parameters, calculate the probability estimate of malignant and benign tumor cells in patients, and determine the optimal threshold to predict tumor traits. Finally, the confusion matrix, sensitivity and specificity are calculated by the testing samples, and the ROC curve is drawn to evaluate the prediction accuracy. The results show that in terms of prediction accuracy, penalized logistic regression ranks the first, SVM prediction accuracy ranks second, XGBoost prediction accuracy ranks third, logistic regression prediction accuracy ranks fourth, GBDT prediction accuracy ranks fifth, and the prediction accuracies for ANN and random forest are the worst.
关键词
惩罚逻辑回归 /
支持向量机 /
梯度提升树算法 /
人工神经网络 /
极限梯度提升算法
{{custom_keyword}} /
Key words
penalized logistic regression /
support vector machine /
gradient boosting decision tree /
artificial neural network /
eXtreme gradient boosting
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Golub T R, Slonim D K, Tamayo P, et al. Molecular classification of cancer:Class discovery and class prediction by gene expression monitoring. Science, 1999, 286(5439):531-537.
[2] Khan J, Wei J S, Ringnér M, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 2001, 7(6):673-679.
[3] Dudoit S, Fridlyand J, Speed T P. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 2002, 97(457):77-87.
[4] Isabelle G, Jason W, Stephen B, et al. Gene selection for cancer classification using support vector machines. Machine Learning, 2002, 46(1-3):389-422.
[5] Lin Y Z, Yu M G, Wang S J, et al. Advanced colorectal neoplasia risk stratification by penalized logistic regression. Statistical Methods in Medical Research, 2016, 25(4):1677-1691.
[6] Morgul M H, Klunk S, Anastasiadou Z, et al. Diagnosis of HCC for patients with cirrhosis using miRNA profiles of the tumor-surrounding tissue-A statistical model based on stepwise penalized logistic regression. Experimental and Molecular Pathology, 2016, 101(2):165-171.
[7] Huang M W, Chen C W, Lin W C, et al. SVM and SVM ensembles in breast cancer prediction. The Public Library of Science One, 2017, 12(1):1-14.
[8] Morshid A, Elsayes Khaled M, Khalaf Ahmed M, et al. A machine learning model to predict hepatocellular carcinoma response to transcatheter arterial chemoembolization. Radiology Artificial Intelligence, 2019, 1(5):1-9.
[9] Liao H T, Xiong T Y, Peng J J, et al. Classification and prognosis prediction from histopathological images of hepatocellular carcinoma by a fully automated pipeline based on machine learning. Annals of Surgical Oncology, 2020, 27(8):2359-2369.
[10] Lu M Y, Fan Z J, Xu B, et al. Using machine learning to predict ovarian cancer. International Journal of Medical Informatics, 2020, 21(5):141-148.
[11] Macaulay B O, Aribisala B S, Akande S A, et al. Breast cancer risk prediction in African women using random forest classifier. Cancer Treatment and Research Communications, 2021, 100396-100424.
[12] Zou Z M, Chang D H, Liu H, et al. Current updates in machine learning in the prediction of therapeutic outcome of hepatocellular carcinoma:What should we know? Insights Into Imaging, 2021, 12(1):12-31.
[13] Cessie S L, Van H J C. Ridge estimators in logistic regression. Journal of the Royal Statistical Society:Series C (Applied statistics), 1992, 41(1):191-201.
[14] Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine Learning Research, 2006, 7(21):2541-2563.
[15] Zhu J, Hastie T. Classification of expression arrays by penalized logistic regression. Biostatistics, 2004, 5(3):427-443.
[16] Meier L, Van D G S, Bühlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 2008, 70(1):53-71.
[17] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York:Springer, 2017.
[18] McCullagh P, Nelder J A. Generalized Linear Models. 2nd Edition. London:Chapman Hall, 1989.
[19] 胡雪梅, 刘锋. 高维统计模型的估计理论与模型识别. 北京:高等教育出版社, 2020. (Hu X M, Liu F. Estimation Theory and Model Recognition of High-Dimensional Statistical Models. Beijing:Higher Education Press, 2020.)
[20] Knight K B, Fu W J. Asymptotics for lasso type estimators. The Annals of Statistics, 2000, 28(5):1356-1378.
[21] Joes S, Michael D A, Meindert N, et al. Ridge-based vessel segmentation in color images of the retina. IEEE Transaction on Medical Imaging, 2004, 23(4):501-509.
[22] Tropp A J. Algorithms for simultaneous sparse approximation. Part II:Convex relaxation. Signal Processing, 2006, 86(3):589-602.
[23] Donoho D L. For most large underdetermined systems of linear equations the minimal L1-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 2006, 59(6):797-829.
[24] Meinshausen N. A note on the lasso for Gaussian graphical model selection. Statistics and Probability Letters, 2007, 78(7):880-884.
[25] Lee A H, Silvapulle M J. Ridge estimation in logistic regression. Communications in Statistics-Simulation and Computation, 1988, 17(4):1231-1257.
[26] Duffy D E, Santner T J. On the small sample properties of norm-restricted maximum likelihood estimators for logistic regression models. Communication in Statistics-Theory and Methods, 1989, 18(3):959-980.
[27] Cortes C, Vapnik V. Support vector networks. Machine Learning, 1995, 20(3):273-297.
[28] 刘靖旭, 蔡怀平, 谭跃进. 支持向量回归参数调整的一种启发式算法. 系统仿真学报, 2007, (7):1540-1543. (Liu J X, Cai H P, Tan Y J. A heuristic algorithm for parameter adjustment of support vector regression. Journal of System Simulation, 2007, (7):1540-1543.)
[29] Friedman J H. Greedy function approximation:A gradient boosting machine. Annals of Statistics, 2001, 29(5):1189-1232.
[30] McCulloch W S, Pitts W. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 1943, 5(4):115-133.
[31] 胡雪梅, 蒋慧凤. 具有技术指标的逻辑回归模型预测谷歌股票的涨跌趋势. 系统科学与数学, 2021, 41(3):802-823. (Hu X M, Jiang H F. Logistic regression model with technical indicators predicts ups and downs for google stock prices. Journal of Systems Science and Mathematical Sciences, 2021, 41(3):802-823.)
[32] Chen T Q, Guestrin C. XGBoost:A scalable tree boosting system. The 22nd ACM SIGKDD International Conference, 2016, 13(8):785-794.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
重庆市第五批高等学校优秀人才支持计划《基于分类方法预测股价的趋势运动》,重庆市科委基础研究与前沿探索一般项目(cstc.2018jcyjA2073),重庆市“统计学”研究生导师团队(yds183002),重庆市社会科学规划项目(2019WT59),社会经济应用统计重庆市重点实验室平台开放项目(KFJJ2018066),重庆市教委科学技术研究计划重大项目(KJZD-M202100801)和重庆工商大学数理统计团队(ZDPTTD201906)资助课题.
{{custom_fund}}