HU Xuemei, LI Jiali, JIANG Huifeng
Liver cancer has the second highest fatality rate among all cancers. Machine learning methods can improve the accuracy of disease prediction. Therefore, in this paper we mainly apply machine learning methods to study the pre-diagnosis problem for liver cancer, and improve the prediction accuracy to liver cancer. Firstly, 10 indicators affecting liver cancer are selected as predictors, and 579 liver cancer patients are divided into two groups:A training sample composed of 492 patients are randomly selected, and a testing sample composed of the remaining 87 patients. Then, we take advantage of the training samples to establish six classifiers:Logistic regression, $L_{2}$ penalized logistic regression, Support Vector Machine (SVM), Gradient Boosting Decision Tree (GBDT), Artificial Neural Network (ANN) and eXtreme Gradient Boosting (XGBoost), where logistic regression and $L_{2}$ penalized logistic regression adopt Newton-Raphson algorithm to obtain the iterative weighted least squares estimators for model parameters, calculate the probability estimate of malignant and benign tumor cells in patients, and determine the optimal threshold to predict tumor traits. Finally, the confusion matrix, sensitivity and specificity are calculated by the testing samples, and the ROC curve is drawn to evaluate the prediction accuracy. The results show that in terms of prediction accuracy, $L_{2}$ penalized logistic regression ranks the first, SVM prediction accuracy ranks second, XGBoost prediction accuracy ranks third, logistic regression prediction accuracy ranks fourth, GBDT prediction accuracy ranks fifth, and the prediction accuracies for ANN and random forest are the worst.