[1]李琳,李玉泽,张钰嘉,等.基于多估计器平均值的深度确定性策略梯度算法[J].郑州大学学报(工学版),2022,43(02):15-21.[doi:10.13705/j.issn.1671-6833.2022.02.013]
 Li Lin,Li Yuze,Zhang Yujia,et al.Mean of Multiple Estimators-ba<x>sed Deep Deterministic Policy Gradient Algorithm[J].Journal of Zhengzhou University (Engineering Science),2022,43(02):15-21.[doi:10.13705/j.issn.1671-6833.2022.02.013]
点击复制

基于多估计器平均值的深度确定性策略梯度算法()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
43卷
期数:
2022年02期
页码:
15-21
栏目:
出版日期:
2022-02-27

文章信息/Info

Title:
Mean of Multiple Estimators-ba<x>sed Deep Deterministic Policy Gradient Algorithm
作者:
李琳李玉泽张钰嘉魏巍
山西大学计算机与信息技术学院;山西大学计算智能与中文信息处理教育部重点实验室;

Author(s):
Li Lin; Li Yuze; Zhang Yujia; Wei Wei;
School of Computer and Information Technology, Shanxi University; Shanxi University Computing Intelligence and Chinese Ministry of Education Key Laboratory;

关键词:
Keywords:
DOI:
10.13705/j.issn.1671-6833.2022.02.013
文献标志码:
A
摘要:
在深度强化学习中,算法的性能与算法的稳定性和估值的准确与否息息相关。传统深度强化学习中存在的过估计问题以及其导致的次优策略问题,即使是在Actor-Critic框架下依然存在。在最近的双延迟深度确定性策略梯度(Twin Delayed Deep Deterministic policy gradient algorithm, TD3)算法中,该算法通过选取一对估计器网络中较小的值解决了高估的问题。然而,TD3算法在进行最小化操作的同时也引入了低估计,使得估计器估计出的Q值低于真实值,进而影响模型的整体性能。对此,本文在TD3的基础上,提出了基于多估计器平均值的确定性策略梯度算法(Mean of Multiple Estimators-ba<x>sed Deterministic Policy Gradient Algorithm, MME-DDPG)。MME-DDPG在选取一对估计器中较小输出值的基础上,再加入多个单独训练的估计器输出的平均值,两者取平均达到缓解低估问题,降低估计方差的目的。本文在理论上对MME-DDPG算法的优越性和稳定性进行分析,并在4个MuJoCo连续控制环境下进行实验,结果表明MME-DDPG算法优于TD3和DDPG算法。
Abstract:
In deep reinforcement learning, an algorithms performance is closely related to the stability and accuracy of the algorithm. The overestimation problem, which leads to suboptimal policies, still exists even in the Actor-Critic fr<x>amework. In the recent Twin Delayed Deep Deterministic policy gradient algorithm(TD3), TD3 solves the overestimation problem by selecting the smaller value in a pair of estimator networks. However, TD3 also brings in underestimation while performing the minimization operation, which makes the estimatedQ value of the estimator lower than the actual value affecting the overall performance of the model. Therefore, this paper proposes the Mean of Multiple Estimators-ba<x>sed Deterministic Policy Gradient Algorithm (MME-DDPG) ba<x>sed on TD3. The MME-DDPG selects the smaller output value of a pair of estimators and then adds the average of several individually trained estimators outputs and finally averages them to alleviate the underestimation problem and reduce the estimation variance. We theoretically analyze the superiority and stability of the MME-DDPG algorithm, and the experimental results in four MuJoCo continuous control environments show the superior performance of the MME-DDPG algorithm over TD3 and DDPG.
更新日期/Last Update: 2022-02-25