[1]李琳,李玉泽,张钰嘉,等.基于多估计器平均值的深度确定性策略梯度算法[J].郑州大学学报(工学版),2022,43(02):15-21.[doi:10.13705/j.issn.1671-6833.2022.02.013]
 LI Lin,LI Yuze,ZHANG Yujia,et al.Deep Deterministic Policy Gradient Algorithm Based on Mean of Multiple Estimators[J].Journal of Zhengzhou University (Engineering Science),2022,43(02):15-21.[doi:10.13705/j.issn.1671-6833.2022.02.013]
点击复制

基于多估计器平均值的深度确定性策略梯度算法()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
43
期数:
2022年02期
页码:
15-21
栏目:
出版日期:
2022-02-27

文章信息/Info

Title:
Deep Deterministic Policy Gradient Algorithm Based on Mean of Multiple Estimators
作者:
李琳12李玉泽1张钰嘉1魏巍12
1.山西大学计算机与信息技术学院;2.山西大学计算智能与中文信息处理教育部重点实验室;

Author(s):
LI Lin12 LI Yuze1 ZHANG Yujia1 WEI Wei12
1.School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China; 
2.Key Laboratory of Computational Intelligence and Chinese Information Processing Ministry of Education, Shanxi University, Taiyuan 030006, China
关键词:
Keywords:
reinforcement learning actor-critic underestimation multiple estimators policy gradient
分类号:
TP391
DOI:
10.13705/j.issn.1671-6833.2022.02.013
文献标志码:
A
摘要:
在深度强化学习中,算法的性能与算法的稳定性和估值的准确与否息息相关。传统深度强化学习中存在的过估计问题以及其导致的次优策略问题,即使是在Actor-Critic框架下依然存在。在最近的双延迟深度确定性策略梯度(Twin Delayed Deep Deterministic policy gradient algorithm, TD3)算法中,该算法通过选取一对估计器网络中较小的值解决了高估的问题。然而,TD3算法在进行最小化操作的同时也引入了低估计,使得估计器估计出的Q值低于真实值,进而影响模型的整体性能。对此,本文在TD3的基础上,提出了基于多估计器平均值的确定性策略梯度算法(Mean of Multiple Estimators-ba<x>sed Deterministic Policy Gradient Algorithm, MME-DDPG)。MME-DDPG在选取一对估计器中较小输出值的基础上,再加入多个单独训练的估计器输出的平均值,两者取平均达到缓解低估问题,降低估计方差的目的。本文在理论上对MME-DDPG算法的优越性和稳定性进行分析,并在4个MuJoCo连续控制环境下进行实验,结果表明MME-DDPG算法优于TD3和DDPG算法。
Abstract:
In order to solve the underestimation problem of the twin delayed deep deterministic policy gradient algorithm in the reinforcement learning actor-critic framework, deep determinstic policy gradient based on mean of multiple estimators(DDPG-MME) was proposed. The method contained one actor and k(k > 3)critics, and the minimum of the output values of two critics and the mean value of the remaining (k-2) critics was calculated first, and then the average of the two values as the final value was taken to calculate the TD error. Finally, we update the critic network based on the TD error, and the actor network is updated based on the value of the first critic. The weighting operation of the method could alleviate the underestimation problem of the twin delayed deep deterministic policy gradient algorithm and reduces the estimation variance to a certain extent to achieve more accurate Q-value estimation. The expectation and variance of the estimation error of our method, deep deterministic policy gradient was analyzed theoretically, and twin delayed deep deterministic policy gradient, and the accuracy and stability of the method was demonstrated. And the experimental results in four MuJoCo continuous control environments, such as Reacher-v2, HalfCheetah-v2, InvertedPendulum-v2 and InvertedDoublePendulum-v2, showed the superior final performance of the deep deterministic policy gradient based on mean of multiple estimators algorithm over TD3 and DDPG, and the results showed that the final performance and stability of our algorithm were significantly better than the comparison algorithms under the same hyperparameters (network structure, reward function, environment parameters, batch size, learning rate, optimizer and discount factor) settings as the comparison algorithms.

参考文献/References:

[1] 陈兴国,俞扬.强化学习及其在电脑围棋中的应用[J].自动化学报,2016,42(5):685-695.

[2] 张凯峰,俞扬.基于逆强化学习的示教学习方法综述[J].计算机研究与发展,2019,56(2):254-261.

更新日期/Last Update: 2022-02-25