利用-means进行数据聚类时,借用不同处理手段其统计距离和聚类中心等会有所差异,从而影响聚类结果,尤其是当数据维度增高时,这种现象更为明显.对此,文章提出一种基于样本方差的多元统计距离算法,并引入改进人工蜂群算法及评价准则函数确定聚类中心和最佳聚类数,优化-means算法.理论上,该方法可以克服原算法易陷入局部最优和固定聚类数等缺陷.最后,通过特异值检测, 人工数据集以及UCI 真实数据集测试验证该优化算法性能.
The distance and cluster centers will be infected by different methods affecting the results, especially analyzing the high-dimension data when -means was applied to data clustering. For that, a multivariate distance algorithm based on sample variance is proposed to measure distance and an improved artificial bee colony algorithm and evaluation criteria function are used to calculate the cluster position and best number of clusters. In theory, this method can overcome the disadvantages including local optimum, and fixed cluster amounts of -means. Finally, the performance of the algorithm is verified on the specific value detection, artificial datasets and UCI datasets.