«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2022.02.013]
点击复制

基于多估计器平均值的深度确定性策略梯度算法()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 43
期数:: 2022年02期

页码:: 15-21

栏目:

出版日期:: 2022-02-27

文章信息/Info

Title:: Deep Deterministic Policy Gradient Algorithm Based on Mean of Multiple Estimators

作者:: 李琳¹; 2; 李玉泽¹; 张钰嘉¹; 魏巍¹; 2; 1.山西大学计算机与信息技术学院;2.山西大学计算智能与中文信息处理教育部重点实验室;

Author(s):: LI Lin¹; 2; LI Yuze1; ZHANG Yujia¹; WEI Wei¹; 2; 1.School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China;
2.Key Laboratory of Computational Intelligence and Chinese Information Processing Ministry of Education, Shanxi University, Taiyuan 030006, China

关键词:: 强化学习; 行动者-评论家; 低估计; 多估计器; 策略梯度

Keywords:: reinforcement learning; actor-critic; underestimation; multiple estimators; policy gradient

分类号:: TP391

DOI:: 10.13705/j.issn.1671-6833.2022.02.013

文献标志码:: A

摘要:: 在深度强化学习中，算法的性能与算法的稳定性和估值的准确与否息息相关。传统深度强化学习中存在的过估计问题以及其导致的次优策略问题，即使是在Actor-Critic框架下依然存在。在最近的双延迟深度确定性策略梯度（Twin Delayed Deep Deterministic policy gradient algorithm, TD3）算法中，该算法通过选取一对估计器网络中较小的值解决了高估的问题。然而，TD3算法在进行最小化操作的同时也引入了低估计，使得估计器估计出的Q值低于真实值，进而影响模型的整体性能。对此，本文在TD3的基础上，提出了基于多估计器平均值的确定性策略梯度算法（Mean of Multiple Estimators-ba＜x＞sed Deterministic Policy Gradient Algorithm, MME-DDPG）。MME-DDPG在选取一对估计器中较小输出值的基础上，再加入多个单独训练的估计器输出的平均值，两者取平均达到缓解低估问题，降低估计方差的目的。本文在理论上对MME-DDPG算法的优越性和稳定性进行分析，并在4个MuJoCo连续控制环境下进行实验，结果表明MME-DDPG算法优于TD3和DDPG算法。

Abstract:: In order to solve the underestimation problem of the twin delayed deep deterministic policy gradient algorithm in the reinforcement learning actor-critic framework, deep determinstic policy gradient based on mean of multiple estimators(DDPG-MME) was proposed. The method contained one actor and k(k > 3)critics, and the minimum of the output values of two critics and the mean value of the remaining (k-2) critics was calculated first, and then the average of the two values as the final value was taken to calculate the TD error. Finally, we update the critic network based on the TD error, and the actor network is updated based on the value of the first critic. The weighting operation of the method could alleviate the underestimation problem of the twin delayed deep deterministic policy gradient algorithm and reduces the estimation variance to a certain extent to achieve more accurate Q-value estimation. The expectation and variance of the estimation error of our method, deep deterministic policy gradient was analyzed theoretically, and twin delayed deep deterministic policy gradient, and the accuracy and stability of the method was demonstrated. And the experimental results in four MuJoCo continuous control environments, such as Reacher-v2, HalfCheetah-v2, InvertedPendulum-v2 and InvertedDoublePendulum-v2, showed the superior final performance of the deep deterministic policy gradient based on mean of multiple estimators algorithm over TD3 and DDPG, and the results showed that the final performance and stability of our algorithm were significantly better than the comparison algorithms under the same hyperparameters (network structure, reward function, environment parameters, batch size, learning rate, optimizer and discount factor) settings as the comparison algorithms.

参考文献/References:

［1］陈兴国，俞扬．强化学习及其在电脑围棋中的应用［J］．自动化学报，2016，42( 5) : 685－695．

［2］张凯峰，俞扬．基于逆强化学习的示教学习方法综述［J］．计算机研究与发展，2019，56( 2) : 254－261．

［3］王丙琛，司怀伟，谭国真．基于深度强化学习的自动驾驶车控制算法研究［J］．郑州大学学报( 工学版) ， 2020，41( 4) : 41－45，80．

［4］ BEＲTSEKAS D P，BEＲTSEKAS D P，BEＲTSEKAS D P，et al． Dynamic programming and optimal control ［M］． Nashua，NH: Athena scientific，1995．

［5］ ANSCHEL O，BAＲAM N，SHIMKIN N，et al． Averaged-DQN: variance reduction and stabilization for deep rein-forcement learning［C］/ /Proceedings of the 34th International Conference on Machine Learning． New York: ACM，2017: 176－185．

［6］ ALLEN C，ASADI K，ＲODEＲICK M，et al．Mean actor critic［EB/OL］． ( 2017 － 06 － 11) ［2021 － 08 － 04］． https: / /arxiv．org /abs/1709. 00503．

［7］ NACHUM O，NOＲOUZI M，TUCKEＲ G，et al． Smoothed action value functions for learning Gaussian policies［EB/OL］． ( 2018 － 10 － 11) ［2021 － 08 － 04］． https: / /arxiv．org /abs/1803. 02348．

［8］ HASSELT H． Double Q-learning［C］/ /Advances in neural information processing systems 23． Boston: MIT，2010: 2613－2621．

［9］ MNIH V，KAVUKCUOGLU K，SILVEＲ D，et al． Human-level control through deep reinforcement learning［J］．Nature，2015，518( 7540) : 529－533．

［10］ MNIH V，KAVUKCUOGLU K，SILVEＲ D，et al． Playing atari with deep reinforcement learning［EB/ OL］． ( 2013－06－11) ［2021－08－04］． http: / /export． arxiv．org /pdf /1312. 5602．

［11］ LI A，LU Z Q，MIAO C L．Ｒevisiting prioritized experience replay: a value perspective［EB/OL］． ( 2021 － 03 － 11) ［2021－08－04］．https: / /arxiv．org /abs/2102. 03261．

［12］ WANG Z，SCHAULT T，HESSEL M，et al． Dueling network archite ctures for deep reinforcement learning ［C ］/ /Proceedings of the 33rd International Conference on machine Learning． New York: ACM， 2016: 1995－2003．

［13］吴金金，刘全，陈松，等．一种权重平均值的深度双 Q 网络方法［J］．计算机研究与发展，2020，57( 3) : 576 －589．

［14］ Van HASSELT H，GUEZ A，SILVEＲ D． Deep reinforcement learning with double Q-Learning［C］/ /Proceedings of the 30th AAAI Conference on Artificial In-telligence． Phoenix: AAAI，2016: 2094－2100．

［15］ PETEＲS J，SCHAAL S．Natural actor-critic［J］．Neurocomputing，2008，71( 7 /8 /9) : 1180－1190．

［16］ LILLICＲAP T P，HUNT J J，PＲITZEL A． Continuous control with deep reinforcement learning［EB/OL］． ( 2019－ 06 － 05) ［2021 － 09 － 09］． https: / /arxiv． org / abs/1509．02971．

［17］ SILVEＲ D， LEVEＲ G， HEESS N， et al． Deterministic policy gradient algorithms ［C］/ /Proceedings of the 31st International Conference on International Conference on Machine Learning． New York: ACM，2014: 387－395．

［18］ FUJIMOTO S，VAN HOOF H，MEGEＲ D． Addressing function approximation error in actor-critic methods ［EB/OL］． ( 2018－03－11) ［2021－08－04］．https: / / arxiv．org /abs/1802. 09477．

［19］刘全，翟建伟，章宗长，等．深度强化学习综述［J］．计算机学报，2018，41( 1) : 1－27．

［20］ SUTTON Ｒ S，MCALLESTEＲ D，SINGH S，et al．Policy gradient methods for reinforcement learning with function approximation ［C］/ /Advances in Neural Information Processing Systems 12．Boston: MIT，2000: 1057－1063．

更新日期/Last Update: 2022-02-25

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics