基于多估计器平均值的深度确定性策略梯度算法-《郑州大学学报(工学版)》

文章信息/Info

Title:: Deep Deterministic Policy Gradient Algorithm Based on Mean of Multiple Estimators

作者:: 李琳¹; 2; 李玉泽¹; 张钰嘉¹; 魏巍¹; 2; 1.山西大学计算机与信息技术学院;2.山西大学计算智能与中文信息处理教育部重点实验室;

Author(s):: LI Lin¹; 2; LI Yuze1; ZHANG Yujia¹; WEI Wei¹; 2; 1.School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China;
2.Key Laboratory of Computational Intelligence and Chinese Information Processing Ministry of Education, Shanxi University, Taiyuan 030006, China

关键词:: 强化学习; 行动者-评论家; 低估计; 多估计器; 策略梯度

Keywords:: reinforcement learning; actor-critic; underestimation; multiple estimators; policy gradient

分类号:: TP391

DOI:: 10.13705/j.issn.1671-6833.2022.02.013

文献标志码:: A

摘要:: 在深度强化学习中，算法的性能与算法的稳定性和估值的准确与否息息相关。传统深度强化学习中存在的过估计问题以及其导致的次优策略问题，即使是在Actor-Critic框架下依然存在。在最近的双延迟深度确定性策略梯度（Twin Delayed Deep Deterministic policy gradient algorithm, TD3）算法中，该算法通过选取一对估计器网络中较小的值解决了高估的问题。然而，TD3算法在进行最小化操作的同时也引入了低估计，使得估计器估计出的Q值低于真实值，进而影响模型的整体性能。对此，本文在TD3的基础上，提出了基于多估计器平均值的确定性策略梯度算法（Mean of Multiple Estimators-ba＜x＞sed Deterministic Policy Gradient Algorithm, MME-DDPG）。MME-DDPG在选取一对估计器中较小输出值的基础上，再加入多个单独训练的估计器输出的平均值，两者取平均达到缓解低估问题，降低估计方差的目的。本文在理论上对MME-DDPG算法的优越性和稳定性进行分析，并在4个MuJoCo连续控制环境下进行实验，结果表明MME-DDPG算法优于TD3和DDPG算法。

Abstract:: In order to solve the underestimation problem of the twin delayed deep deterministic policy gradient algorithm in the reinforcement learning actor-critic framework, deep determinstic policy gradient based on mean of multiple estimators(DDPG-MME) was proposed. The method contained one actor and k(k > 3)critics, and the minimum of the output values of two critics and the mean value of the remaining (k-2) critics was calculated first, and then the average of the two values as the final value was taken to calculate the TD error. Finally, we update the critic network based on the TD error, and the actor network is updated based on the value of the first critic. The weighting operation of the method could alleviate the underestimation problem of the twin delayed deep deterministic policy gradient algorithm and reduces the estimation variance to a certain extent to achieve more accurate Q-value estimation. The expectation and variance of the estimation error of our method, deep deterministic policy gradient was analyzed theoretically, and twin delayed deep deterministic policy gradient, and the accuracy and stability of the method was demonstrated. And the experimental results in four MuJoCo continuous control environments, such as Reacher-v2, HalfCheetah-v2, InvertedPendulum-v2 and InvertedDoublePendulum-v2, showed the superior final performance of the deep deterministic policy gradient based on mean of multiple estimators algorithm over TD3 and DDPG, and the results showed that the final performance and stability of our algorithm were significantly better than the comparison algorithms under the same hyperparameters (network structure, reward function, environment parameters, batch size, learning rate, optimizer and discount factor) settings as the comparison algorithms.

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics