«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2025.06.006]
点击复制

基于特征转换和少数类聚类的微生物数据扩增算法()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 46
期数:: 2025年06期

页码:: 23-31

栏目:

出版日期:: 2025-10-25

文章信息/Info

Title:: Microbial Data Augmentation Algorithm Based on Feature Transformation and Minority Clustering

文章编号:: 1671-6833(2025)06-0023-09

作者:: 温柳英; 郑天浩; 西南石油大学计算机与软件学院,四川成都 610500

Author(s):: WEN Liuying; ZHENG Tianhao; School of Computer Science and Software Engineering, Southwest Petroleum University, Chengdu 610500, China

关键词:: 微生物数据; 高维; 稀疏; 类别不平衡; 聚类; 数据扩增

Keywords:: microbial data; high-dimensional; sparsity; class imbalance; cluster; data augmentation

分类号:: TP391Q939.9TP311.13

DOI:: 10.13705/j.issn.1671-6833.2025.06.006

文献标志码:: A

摘要:: 微生物数据的高维、高零值率特性及少数类样本稀缺导致的类别不平衡,显著降低了分类器的少数类识别能力,而现有扩增算法对高不平衡比(IR)敏感且难以有效合成样本。针对此问题,提出了一种基于特征转换和少数类聚类的微生物数据扩增算法(FTMC)。首先,该算法在特征转换阶段采用主成分分析算法对高维数据进行降维,以缓解数据强稀疏性问题;其次,在少数类聚类阶段,使用K-means算法捕捉少数类的局部特征,获得多个聚类;再次,在聚类筛选阶段,基于每个聚类的密度和难度,结合IR和权重比来计算其权重值,并以此筛选出核心聚类子集,用于后续样本生成;最后,在样本扩增过滤阶段,利用线性插值算法,对筛选后的每个核心聚类进行样本扩增,并使用局部异常因子算法过滤异常点,确保扩增样本的质量。在12个微生物数据集上进行实验,并在3个分类器下对比8个同类型采样算法的性能,结果表明:FTMC生成的样本更具多样性,在Recall指标上平均提高了26.42%,证明该算法能正确识别更多的阳性样本。

Abstract:: The high-dimensional characteristics of microbial data, the high zero-value rate, and the scarcity of minority-class samples, which led to class imbalance, had significantly weakened classifiers′ ability to identify minority class. Existing augmentation algorithms are sensitive to high imbalance ratios (IR) and struggle to effectively synthesize samples. In this study a microbial data augmentation algorithm based on feature transformation and minority class clustering (FTMC) was presenteed. Firstly, the feature transformation stage used the principal components analysis algorithm to down thescale high-dimensional data to alleviate the problem of strong data sparsity. Subsequently, in the minority class clustering stage, the K-Means algorithm was used to capture the local features of the minority classes and obtain multiple clusters. In the cluster screening stage, based on the density and difficulty of each cluster, combined with the IR and weight ratio, its weight value was calculated and used to screen a subset of core clusters for subsequent sample generation. Finally, in the sample augmentation and filtering stage, a linear interpolation algorithm was used augment to the samples for each core cluster, and a local anomaly factor algorithm was used to filter outliers to ensure the quality of the augmented samples. The experiments were conducted on 12 microbial datasets and the performance was compared with 8 sampling algorithms of the same type with 3 classifiers.Results indicated that samples generated by FTMC were more diverse, with an average improvement of 26.42% in the Recall metric. This demonstrated that the algorithm could correctly identify more positive samples.

参考文献/References:

[1]MEGAHED F M, CHEN Y J, MEGAHED A, et al. The class imbalance problem[J]. Nature Methods, 2021,18 (11): 1270-1272.

[2]田鸿朋,张震,张思源,等.复合可靠性分析下的不平衡数据证据分类[J].郑州大学学报(工学版),2023, 44(4):22-28.

TIAN H P,ZHANG Z,ZHANG S Y,et al. Imbala-nced data evidential classification with composite reliability [J].Journal of Zhengzhou University (Engine-ering Science),2023,44(4):22-28.

[3]THABTAH F, HAMMOUD S, KAMALOV F, et al. Data imbalance in classification: experimental evaluation[J]. Information Sciences, 2020, 513: 429-441.

[4]CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.

[5]HE H B, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced le-arning[C]∥2008 IEEE International Joint Conference onNeural Networks. Piscataway: IEEE, 2008: 1322-1328.

[6]MOREO A, ESULI A, SEBASTIANI F. Distributional random oversampling for imbalanced text classification [C]∥Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Informat-ion Retrieval. New York: ACM, 2016: 805-808.

[7]王曦, 温柳英, 闵帆. 融合矩阵分解和代价敏感的微生物数据扩增算法[J]. 数据采集与处理, 2023, 38 (2): 401-412.

WANG X, WEN L Y, MIN F. Fusing matrix factorization and cost-sensitive microbial data augmentation algorithm[J]. Journal of Data Acquisition and Processing, 2023, 38(2): 401-412.

[8]温柳英, 吴俊, 闵帆. 融合矩阵分解和空间划分的微生物数据扩增方法[J]. 山东大学学报(理学版), 2025, 60(1): 14-28,44.

WEN L Y, WU J, MIN F. Fusing matrix factorization and space partition microbial data augmentation algorithm [J]. Journal of Shandong University (Natural Science), 2025, 60(1): 14-28, 44.

[9]LI Y, HUANG C, DING L Z, et al. Deep learning in bioinformatics: introduction, application, and perspective in the big data era[J]. Methods, 2019, 166: 4-21.

[10] LI Y X, CHAI Y, YIN H P, et al. A novel feature learning framework for high-dimensional data classification [J]. International Journal of Machine Learning and Cybernetics, 2021, 12(2): 555-569.

[11]WEN L Y, ZHANG X M, LI Q F, et al. KGA: integrating KPCA and GAN for microbial data augmentation [J]. International Journal of Machine Learning and Cybernetics, 2023, 14(4): 1427-1444.

[12] FENG W, HUANG W J, REN J C. Class imbalance ensemble learning based on the margin theory[J]. Applied Sciences, 2018, 8(5): 815.

[13] ABDI H, WILLIAMS L J. Principal component analysis [J]. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(4): 433-459.

[14] AHMED M, SERAJ R, ISLAM S M S. The k-means algorithm: a comprehensive survey and performance evaluation[J]. Electronics, 2020, 9(8): 1295.

[15] BREUNIG M M, KRIEGEL H P, NG R T, et al. LOF: identifying density-based local outliers[C]∥Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2000: 93-104.

[16] HE X, ZHAO K Y, CHU X W. AutoML: a survey of the state-of-the-art[J]. Knowledge-Based Systems, 2021, 212: 106622.

[17] DOUZAS G, BACAO F, LAST F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE[J]. Information Sciences, 2018, 465: 1-20.

[18] DENG D S. DBSCAN clustering algorithm based on density[C]∥2020 7th International Forum on Electrical Engineering and Automation (IFEEA). Piscataway: IEEE, 2020: 949-953.

[19] ANKERST M, BREUNIG M M, KRIEGEL H P, et al. OPTICS: ordering points to identify the clustering structure[C]∥Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. New York: ACM, 1999: 49-60.

[20] SCHILT K. The importance of being Agnes[J]. Symbolic Interaction, 2016, 39(2): 287-294.

更新日期/Last Update: 2025-10-21

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics