«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2023.02.007]
点击复制

基于新的距离度量的异构属性数据子空间聚类()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 44卷
期数:: 2023年02期

页码:: 53-60

栏目:

出版日期:: 2023-02-27

文章信息/Info

Title:: Subspace Clustering of Heterogeneous-attribute Data Based on a New Distance Metric

作者:: 邓秀勤¹; 郑丽苹¹; 张逸群²; 刘冬冬¹; 1. 广东工业大学数学与统计学院,广东广州 510520; 2. 广东工业大学计算机学院,广东广州 510006

Author(s):: DENG Xiuqin; ZHENG Liping; ZHANG Yiqun; LIU Dongdong; School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou 510520, Guangdong University of Technology, Guangdong University of Technology, Guangzhou 510006

关键词:: 异构属性数据; 有序属性; 距离度量; 子空间聚类算法; 动态权重

Keywords:: heterogeneous-attribute data; ordinal attribute; distance metric; subspace clustering algorithm; dynamic weighting

分类号:: O235；TP311. 13

DOI:: 10.13705/j.issn.1671-6833.2023.02.007

文献标志码:: A

摘要:: 真实数据集中往往包含分类属性和数值属性,其中分类属性可分为有序属性和标称属性,同时具有分类属性和数值属性的数据集可称为异构属性数据。针对现有异构属性数据距离度量不区分分类属性中的有序属性导致信息缺失、聚类效果不理想这一问题,提出了一种基于新的距离度量的异构属性数据子空间聚类算法。首先, 总结了现有的异构属性数据距离度量的思路和区分有序属性的解决方案;其次,利用不同属性的数据特征分别定义了有序属性、标称属性和数值属性下的属性值之间的距离公式;再次,利用簇间差异和簇内距离这 2 个因素分别给出了不同属性在聚类过程中的动态加权方案;最后,联立距离公式和加权机制得到了可适用于异构属性数据的距离度量,进而设计了一种基于新的距离度量的异构属性数据子空间聚类算法。由于该算法既统一了异构属性数据的距离度量又能在子空间中进行簇搜索,因此该算法能在异构属性数据集上取得良好的聚类效果,在 11 个真实数据集上的对比实验结果验证了此算法的有效性。

Abstract:: Real datasets often contain categorical and numerical attributes, and categorical attributes can be divided into ordinal and nominal attributes. Datasets with both categorical and numerical attributes can be called heterogeneous-attribute data. To solve the problem that the existing distance metrics of heterogeneous-attribute data can not distinguish ordinal attributes in the categorical attributes resulting in missing information and poor clustering effect, a new subspace clustering algorithm based on distance metric was proposed. Firstly, this study summarized the existing progress of distance metric of heterogeneous-attribute data and the solutions to distinguish ordinal attribute. Then the distance formulas were defined for the attribute values of ordinal, nominal, and numerical attributes from the perspective of their natural characteristics. Subsequently, a dynamic weighting scheme was proposed to weight different attributes according to their contributed inter-and intra-cluster distances during clustering. Finally, the distance formula and dynamic weighting scheme were combined to form the distance metric applicable to heterogeneous-attribute data, and a subspace clustering algorithm for heterogeneous-attribute data was thus proposed. Because the algorithm unified the distance metric of heterogeneous-attribute data and could search clusters in subspace, it could achieve good clustering effect on heterogeneous-attribute data. Experimental results on 11 real data sets showed the effectiveness of the algorithm.

参考文献/References:

[1] 姜鸣, 赵红宇, 刘学良. 一种基于聚类分析的自适应步态检测方法 [ J] . 郑州大学学报 ( 工学版) , 2017, 38(3) : 63-67.

JIANG M, ZHAO H Y, LIU X L. An adaptive gait detection method based on clustering analysis [ J] . Journal of Zhengzhou University ( Engineering Science) , 2017, 38(3) : 63-67.

[2] 王军芬,刘培跃,董建彬,等. 用于分割无损检测图像的快速模糊 C 均值算法[ J] . 郑州大学学报(工学版) , 2022,43(6) :42-48.

WANG J F, LIU P Y, DONG J B, et al. Fast fuzzy C means algorithm for segmentation of non-destructive testing image [ J] . Journal of Zhengzhou University ( Engineering Science) , 2022, 43(6) :42-48.

[3] AGRESTI A. An introduction to categorical data analysis [M] . New York:John Wiley & Sons, 2018.

[4] HAMMING R W. Error detecting and error correcting codes[ J] . The Bell System Technical Journal, 1950, 29 (2) : 147-160.

[5] AHMAD A, KHAN S S. Survey of state-of-the-art mixed data clustering algorithms [ J ] . IEEE Access, 2019, 7: 31883-31902.

[6] HUANG Z X. Clustering large data sets with mixed numeric and categorical values [ C ] / / Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining. New York:Springer,1997: 21-34.

[7] HUANG J Z, NG M K, RONG H Q, et al. Automated variable weighting in k-means type clustering[ J] . IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(5) : 657-668.

[8] IENCO D, PENSA R, MEO R. From context to dis- 60 郑州大学学报 (工学版) 2023 年 tance: learning dissimilarity for categorical data clustering [ J] . ACM Transactions on Knowledge Discovery From Data, 2012, 6,(1) : 1-25.

[9] JIAN S L, CAO L B, LU K, et al. Unsupervised coupled metric similarity for non-IID categorical data [ J] . IEEE Transactions on Knowledge and Data Engineering, 2018, 30(9) : 1810-1823.

[10] AGRESTI A. Analysis of ordinal categorical data [ M] . Hoboken: Wiley, 2010.

[11] ZHANG Y Q, CHEUNG Y M. A new distance metric exploiting heterogeneous interattribute relationship for ordinal-and-nominal-attribute data clustering [ J ] . IEEE Transactions on Cybernetics, 2022, 52(2) : 758-771.

[12] ZHANG Y Q, CHEUNG Y M. Learnable weighting of intra-attribute distances for categorical data clustering with nominal and ordinal attributes[ J] . IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (7) : 3560-3576.

[13] CHEUNG Y M, JIA H. Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number[ J] . Pattern Recognition, 2013, 46(8) : 2228-2238.

[14] JIA H, CHEUNG Y M. Subspace clustering of categorical and numerical data with an unknown number of clusters [ J] . IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(8) : 3308-3325.

[15] OOSTERHOFF J, VAN ZWET W R. A note on contiguity and hellinger distance [ EB / OL ] . ( 2011 - 01 - 01 ) [ 2022 - 03 - 12 ] . https: / / doi. org / 10. 1007 / 978 - 1 - 4614-1314-1_6.

更新日期/Last Update: 2023-02-25

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics