[1]LI Lin,LI Yuze,ZHANG Yujia,et al.Deep Deterministic Policy Gradient Algorithm Based on Mean of Multiple Estimators[J].Journal of Zhengzhou University (Engineering Science),2022,43(02):15-21.[doi:10.13705/j.issn.1671-6833.2022.02.013]
Copy
Journal of Zhengzhou University (Engineering Science)[ISSN
1671-6833/CN
41-1339/T] Volume:
43
Number of periods:
2022 02
Page number:
15-21
Column:
Public date:
2022-02-27
- Title:
-
Deep Deterministic Policy Gradient Algorithm Based on Mean of Multiple Estimators
- Author(s):
-
LI Lin1; 2; LI Yuze1; ZHANG Yujia1; WEI Wei1; 2
-
1.School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China;
2.Key Laboratory of Computational Intelligence and Chinese Information Processing Ministry of Education, Shanxi University, Taiyuan 030006, China
-
- Keywords:
-
reinforcement learning; actor-critic; underestimation; multiple estimators; policy gradient
- CLC:
-
TP391
- DOI:
-
10.13705/j.issn.1671-6833.2022.02.013
- Abstract:
-
In order to solve the underestimation problem of the twin delayed deep deterministic policy gradient algorithm in the reinforcement learning actor-critic framework, deep determinstic policy gradient based on mean of multiple estimators(DDPG-MME) was proposed. The method contained one actor and k(k > 3)critics, and the minimum of the output values of two critics and the mean value of the remaining (k-2) critics was calculated first, and then the average of the two values as the final value was taken to calculate the TD error. Finally, we update the critic network based on the TD error, and the actor network is updated based on the value of the first critic. The weighting operation of the method could alleviate the underestimation problem of the twin delayed deep deterministic policy gradient algorithm and reduces the estimation variance to a certain extent to achieve more accurate Q-value estimation. The expectation and variance of the estimation error of our method, deep deterministic policy gradient was analyzed theoretically, and twin delayed deep deterministic policy gradient, and the accuracy and stability of the method was demonstrated. And the experimental results in four MuJoCo continuous control environments, such as Reacher-v2, HalfCheetah-v2, InvertedPendulum-v2 and InvertedDoublePendulum-v2, showed the superior final performance of the deep deterministic policy gradient based on mean of multiple estimators algorithm over TD3 and DDPG, and the results showed that the final performance and stability of our algorithm were significantly better than the comparison algorithms under the same hyperparameters (network structure, reward function, environment parameters, batch size, learning rate, optimizer and discount factor) settings as the comparison algorithms.