«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2025.05.016]
点击复制

融合CLIP和3D高斯的多模态场景编辑算法()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 46
期数:: 2025年05期

页码:: 35-42

栏目:

出版日期:: 2025-08-10

文章信息/Info

Title:: Multimodal Scene Editing Algorithm Integrating CLIP and 3D Gaussian

文章编号:: 1671-6833(2025)05-0035-08

作者:: 曹仰杰; 王伟平; 李振强; 谢　俊; 吕润峰; 郑州大学网络空间安全学院,河南郑州 450002

Author(s):: CAO Yangjie; WANG Weiping; LI Zhenqiang; XIE Jun; LYU Runfeng; School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450002, China

关键词:: 3D重建; 零样本学习; 场景理解; 场景编辑; 3D高斯

Keywords:: 3D reconstruction; zero-shot learning; scene understanding; scene editing; 3D Gaussian

分类号:: TP391TP751.1

DOI:: 10.13705/j.issn.1671-6833.2025.05.016

文献标志码:: A

摘要:: 针对3D场景编辑算法对标注数据过度依赖和计算复杂度高的问题,提出了一种融合CLIP与3D高斯的多模态场景编辑算法(CLIP2Gaussian)。首先,利用SAM从多视角图像中提取目标掩码,并引入双向传播策略实现不同视角之间的掩码一致性;其次,将提取的掩码通过CLIP进行语义标签分配,并映射到3D高斯点,实现3D场景的语义嵌入;最后,采用可微分渲染机制对3D高斯参数进行优化,同时引入空间一致性正则化策略,通过聚类增强语义标签在3D空间中的一致性与稳定性。实验结果表明:CLIP2Gaussian在LERF数据集上IoU达到61.23%,语义分割任务中单次文本查询响应时间为0.57 s,准确率和效率均优于LERF。消融实验进一步验证了所提算法在最小扰动原始场景的前提下对目标区域的精准编辑。

Abstract:: To address the issues of excessive reliance on annotated data and high computational complexity in 3D scene editing algorithms, in this study a multimodal scene editing method named CLIP2Gaussian was proposed, which integrated CLIP with 3D Gaussian. Firstly, the algorithm employed SAM to extract target masks from multiview images and introduced a bidirectional propagation strategy to ensure mask consistency across different views. Secondly, the extracted masks were assigned semantic labels using CLIP and mapped to 3D Gaussian points to enable semantic embedding in the 3D scene. Finally, a differentiable rendering mechanism was used to optimize the parameters of the 3D Gaussians, and a spatial consistency regularization strategy was introduced by applying clustering to enhance the consistency and stability of semantic labels in 3D space. Experimental results showed that CLIP2Gaussian achieved 61.23% IoU on the LERF dataset and a per-query response time of 0.57 seconds in semantic segmentation tasks, improving the speed by 54 times compared to LERF while achieving superior accuracy and efficiency. Ablation studies further verified that the proposed method enabled precisely editing of target regions with minimal disturbance to the original scene.

参考文献/References:

[1]纪勇, 刘丹丹, 罗勇, 等. 基于霍夫投票的变电站设备三维点云识别算法[J]. 郑州大学学报(工学版), 2019, 40(3): 1-6, 12.

JI Y, LIU D D, LUO Y, et al. Recognition of three-dimensional substation equipment based on Hough transform [J]. Journal of Zhengzhou University (Engineering Science), 2019, 40(3): 1-6, 12.

[2]陈义飞, 郭胜, 潘文安, 等. 基于多源传感器数据融合的三维场景重建[J]. 郑州大学学报(工学版), 2021, 42(2): 80-86.

CHEN Y F, GUO S, PAN W A, et al. 3D scene reconstruction based on multi-source sensor data fusion[J]. Journal of Zhengzhou University (Engineering Science), 2021, 42(2): 80-86.

[3]MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[EB/OL].(2020-03-19)[2025-0212].https:∥doi.org/10.48550/arXiv.2003.08934.

[4]KERR J, KIM C M, GOLDBERG K, et al. LERF: language embedded radiance fields[C]∥2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 19672-19682.

[5]KERBL B, KOPANAS G, LEIMKUEHLER T, et al. 3D Gaussian splatting for real-time radiance field rendering [J]. ACM Transactions on Graphics, 2023, 42(4): 1-14.

[6]YE M Q, DANELLJAN M, YU F, et al. Gaussian Grouping: segment and edit anything in 3D scenes[EB/ OL].(2023-12-1)[2025-02-12]. https:∥doi. org/ 10.48550/arXiv.2312.00732.

[7]KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]∥2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 3992-4003.

[8]RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL].(2021-02-26)[2025-02-12].https:∥ doi.org/10.48550/arXiv.2103.00020.

[9]CHERAGHIAN A, RAHMAN S, PETERSSON L. Zeroshot learning of 3D point cloud objects[C]∥The 16th International Conference on Machine Vision Applications. Piscataway:IEEE, 2019: 1-6.

[10] ZHANG R R, GUO Z Y, ZHANG W, et al. PointCLIP: point cloud understanding by CLIP[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2022: 8542-8552.

[11] ZHOU C, LOY C C, DAI B. Extract free dense labels from CLIP[EB/OL]. (2021-12-2)[2025-02-12]. https:∥doi.org/10.48550/arXiv.2112.01071.

[12] SHEN Q H, YANG X Y, WANG X C. FlashSplat: 2D to 3D Gaussian splatting segmentation solved optimally[EB/ OL].(2024-09-12)[2025-02-12].https:∥doi.org/10. 48550/arXiv.2409.08270.

[13]WANG H X, VASU P K A, FAGHRI F, et al. SAMCLIP: merging vision foundation models towards semantic and spatial understanding[C]∥2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2024: 3635-3647.

[14] GORDON O, AVRAHAMI O, LISCHINSKI D. BlendedNeRF: zero-shot object generation and blending in existing neural radiance fields[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway: IEEE, 2023: 2941-2951.

[15] HAQUE A, TANCIK M, EFROS A A, et al. InstructNeRF2NeRF: editing 3D scenes with instructions[C]∥ 2023 IEEE/CVF International Conference on Computer Vision. Piscataway:IEEE, 2023: 19683-19693.

[16] CHENG H K, OH S W, PRICE B, et al. Tracking anything with decoupled video segmentation[C]∥2023 IEEE/CVF International Conference on Computer Vision. Piscataway:IEEE, 2023: 1316-1326.

[17] ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]∥ Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. New York: ACM,1996: 226-231.

[18] SUVOROV R, LOGACHEVA E, MASHIKHIN A, et al. Resolution-robust large mask inpainting with Fourier convolutions[C]∥2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2022: 3172-3182.

[19] BARRON J T, MILDENHALL B, VERBIN D, et al. Mip-NeRF 360: unbounded anti-aliased neural radiance fields[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2022: 5460-5469.

[20]WANG Y J, LI J H, LU Y, et al. Image quality evaluation based on image weighted separating block peak signal to noise ratio[C]∥Proceedings of 2003 International Conference on Neural Networks and Signal. Piscataway: IEEE, 2003: 994-997.

[21]WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600-612.

[22] ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric [C]∥2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 586-595.

[23] LI B Y, WEINBERGER K Q, BELONGIE S, et al. Language-driven semantic segmentation[EB/OL]. (202201-10)[2025-02-12]. https:∥doi. org/10.48550/ arXiv.2201.03546.

更新日期/Last Update: 2025-09-19

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics