Deep Deterministic Policy Gradient Algorithm Based on Mean of Multiple Estimators

NAVIGATE

Table of Contents

STATISTICS

Viewed2447

Downloads1126

Deep Deterministic Policy Gradient Algorithm Based on Mean of Multiple Estimators

[HTML] PDF下载 (1126)

[1]LI Lin,LI Yuze,ZHANG Yujia,et al.Deep Deterministic Policy Gradient Algorithm Based on Mean of Multiple Estimators[J].Journal of Zhengzhou University (Engineering Science),2022,43(02):15-21.[doi:10.13705/j.issn.1671-6833.2022.02.013]

Copy

Journal of Zhengzhou University (Engineering Science)[ISSN 1671-6833/CN 41-1339/T] Volume: 43 Number of periods: 2022 02 Page number: 15-21 Column: Public date: 2022-02-27

Title:: Deep Deterministic Policy Gradient Algorithm Based on Mean of Multiple Estimators

Author(s):: LI Lin¹; 2; LI Yuze1; ZHANG Yujia¹; WEI Wei¹; 2; 1.School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China;
2.Key Laboratory of Computational Intelligence and Chinese Information Processing Ministry of Education, Shanxi University, Taiyuan 030006, China

Keywords:: reinforcement learning; actor-critic; underestimation; multiple estimators; policy gradient

CLC:: TP391

DOI:: 10.13705/j.issn.1671-6833.2022.02.013

Abstract:: In order to solve the underestimation problem of the twin delayed deep deterministic policy gradient algorithm in the reinforcement learning actor-critic framework, deep determinstic policy gradient based on mean of multiple estimators(DDPG-MME) was proposed. The method contained one actor and k(k > 3)critics, and the minimum of the output values of two critics and the mean value of the remaining (k-2) critics was calculated first, and then the average of the two values as the final value was taken to calculate the TD error. Finally, we update the critic network based on the TD error, and the actor network is updated based on the value of the first critic. The weighting operation of the method could alleviate the underestimation problem of the twin delayed deep deterministic policy gradient algorithm and reduces the estimation variance to a certain extent to achieve more accurate Q-value estimation. The expectation and variance of the estimation error of our method, deep deterministic policy gradient was analyzed theoretically, and twin delayed deep deterministic policy gradient, and the accuracy and stability of the method was demonstrated. And the experimental results in four MuJoCo continuous control environments, such as Reacher-v2, HalfCheetah-v2, InvertedPendulum-v2 and InvertedDoublePendulum-v2, showed the superior final performance of the deep deterministic policy gradient based on mean of multiple estimators algorithm over TD3 and DDPG, and the results showed that the final performance and stability of our algorithm were significantly better than the comparison algorithms under the same hyperparameters (network structure, reward function, environment parameters, batch size, learning rate, optimizer and discount factor) settings as the comparison algorithms.

References:: ［1］陈兴国，俞扬．强化学习及其在电脑围棋中的应用［J］．自动化学报，2016，42( 5) : 685－695．
［2］张凯峰，俞扬．基于逆强化学习的示教学习方法综述［J］．计算机研究与发展，2019，56( 2) : 254－261．
［3］王丙琛，司怀伟，谭国真．基于深度强化学习的自动驾驶车控制算法研究［J］．郑州大学学报( 工学版) ， 2020，41( 4) : 41－45，80．
［4］ BEＲTSEKAS D P，BEＲTSEKAS D P，BEＲTSEKAS D P，et al． Dynamic programming and optimal control ［M］． Nashua，NH: Athena scientific，1995．
［5］ ANSCHEL O，BAＲAM N，SHIMKIN N，et al． Averaged-DQN: variance reduction and stabilization for deep rein-forcement learning［C］/ /Proceedings of the 34th International Conference on Machine Learning． New York: ACM，2017: 176－185．
［6］ ALLEN C，ASADI K，ＲODEＲICK M，et al．Mean actor critic［EB/OL］． ( 2017 － 06 － 11) ［2021 － 08 － 04］． https: / /arxiv．org /abs/1709. 00503．
［7］ NACHUM O，NOＲOUZI M，TUCKEＲ G，et al． Smoothed action value functions for learning Gaussian policies［EB/OL］． ( 2018 － 10 － 11) ［2021 － 08 － 04］． https: / /arxiv．org /abs/1803. 02348．
［8］ HASSELT H． Double Q-learning［C］/ /Advances in neural information processing systems 23． Boston: MIT，2010: 2613－2621．
［9］ MNIH V，KAVUKCUOGLU K，SILVEＲ D，et al． Human-level control through deep reinforcement learning［J］．Nature，2015，518( 7540) : 529－533．
［10］ MNIH V，KAVUKCUOGLU K，SILVEＲ D，et al． Playing atari with deep reinforcement learning［EB/ OL］． ( 2013－06－11) ［2021－08－04］． http: / /export． arxiv．org /pdf /1312. 5602．
［11］ LI A，LU Z Q，MIAO C L．Ｒevisiting prioritized experience replay: a value perspective［EB/OL］． ( 2021 － 03 － 11) ［2021－08－04］．https: / /arxiv．org /abs/2102. 03261．
［12］ WANG Z，SCHAULT T，HESSEL M，et al． Dueling network archite ctures for deep reinforcement learning ［C ］/ /Proceedings of the 33rd International Conference on machine Learning． New York: ACM， 2016: 1995－2003．
［13］吴金金，刘全，陈松，等．一种权重平均值的深度双 Q 网络方法［J］．计算机研究与发展，2020，57( 3) : 576 －589．
［14］ Van HASSELT H，GUEZ A，SILVEＲ D． Deep reinforcement learning with double Q-Learning［C］/ /Proceedings of the 30th AAAI Conference on Artificial In-telligence． Phoenix: AAAI，2016: 2094－2100．

［15］ PETEＲS J，SCHAAL S．Natural actor-critic［J］．Neurocomputing，2008，71( 7 /8 /9) : 1180－1190．

［16］ LILLICＲAP T P，HUNT J J，PＲITZEL A． Continuous control with deep reinforcement learning［EB/OL］． ( 2019－ 06 － 05) ［2021 － 09 － 09］． https: / /arxiv． org / abs/1509．02971．

［17］ SILVEＲ D， LEVEＲ G， HEESS N， et al． Deterministic policy gradient algorithms ［C］/ /Proceedings of the 31st International Conference on International Conference on Machine Learning． New York: ACM，2014: 387－395．

［18］ FUJIMOTO S，VAN HOOF H，MEGEＲ D． Addressing function approximation error in actor-critic methods ［EB/OL］． ( 2018－03－11) ［2021－08－04］．https: / / arxiv．org /abs/1802. 09477．

［19］刘全，翟建伟，章宗长，等．深度强化学习综述［J］．计算机学报，2018，41( 1) : 1－27．

［20］ SUTTON Ｒ S，MCALLESTEＲ D，SINGH S，et al．Policy gradient methods for reinforcement learning with function approximation ［C］/ /Advances in Neural Information Processing Systems 12．Boston: MIT，2000: 1057－1063．

Similar References:

Memo

Last Update: 2022-02-25