[1]李志辉,马莹,尚志刚,等.鸽子序贯决策中动态强化学习建模与策略演变[J].郑州大学学报(工学版),2026,47(XX):1-7.[doi:10.13705/j.issn.1671-6833.2025.05.026]
 LI Zhihui,MA Ying,SHANG Zhigang,et al.Dynamic Reinforcement Learning Modeling and Strategy Evolution in Pigeon Sequential Decision-making[J].Journal of Zhengzhou University (Engineering Science),2026,47(XX):1-7.[doi:10.13705/j.issn.1671-6833.2025.05.026]
点击复制

鸽子序贯决策中动态强化学习建模与策略演变()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
47
期数:
2026年XX
页码:
1-7
栏目:
出版日期:
2026-09-10

文章信息/Info

Title:
Dynamic Reinforcement Learning Modeling and Strategy Evolution in Pigeon Sequential Decision-making
作者:
李志辉12 马莹12 尚志刚12 杨莉芳123∗
1.郑州大学 电气与信息工程学院,河南 郑州 450001;2.河南省脑科学与脑机接口技术重点实验室,河南 郑州 450001; 3.郑州大学附属脑病医院,河南 驻马店 463000
Author(s):
LI Zhihui12 MA Ying12 SHANG Zhigang12 YANG Lifang123∗
1. School of Electrical and Inform ation Engineering, Zhengzhou University, Zhengzhou 450001, China; 2. Henan Key Laboratory of Brain Science and Brain Computer Interface Technology, Zhengzhou 450001, China; 3. The Affiliated Encephalopathy Hospital of Zhengzhou University, Zhumadian 463000, China
关键词:
序贯决策强化学习鸽子学习策略Model-Based Model-Free
Keywords:
sequential decision-making reinforcement learning pigeons learning strategies Model-Based Model-Free
分类号:
Q811. 211
DOI:
10.13705/j.issn.1671-6833.2025.05.026
文献标志码:
B
摘要:
生物体为了最大化未来回报,在复杂环境中需灵活调整学习策略。为探究生物体在序贯决策学习过程中学习策略的动态演变规律,以鸽子为模式动物,设计了两步序贯决策实验范式,记录了鸽子从初始探索到习得整个阶段的行为学数据,分别构建了基于奖励预测误差驱动的Model-Free(MF)与状态间关系驱动的Model-Based(MB)两类动态强化学习(reinforcement learning,RL)模型,利用实验数据对上述模型进行拟合,并系统分析了模型中关键参数—学习率(表征学习新信息的速度)、折扣率(表征对未来奖励的重视程度)和“逆温度”参数(表征决策的确定性)的动态变化特征。结果表明:鸽子在学习的早期阶段主要采用MB策略,侧重于掌握状态之间的关系并形成价值表征;随着经验的积累,逐渐转向MF策略,更直接地利用已经获得的价值信息。此外,模型参数分析显示学习过程中学习率逐渐降低,折扣率逐渐升高,“逆温度”参数也逐渐增大,表明鸽子对未来奖励的关注和决策确定性均随经验的增加而显著提升,体现了生物体在序贯决策学习中从探索环境到利用已有经验的自然转变过程。本研究不仅有助于揭示生物体在复杂环境中如何灵活调整学习策略,还为机器强化学习模型中参数的设置提供了有益的启示。
Abstract:
To maximize future rewards, organisms must flexibly adjust their learning strategies within complex environments. To investigate how learning strategies dynamically evolve during sequential decision-making, we used pigeons—a model species with robust cognitive capabilities—in a two-step sequential decision-making task. Behavioral data were collected throughout the entire learning process, from initial exploration to proficient performance. We developed two dynamic reinforcement learning (RL) models: a reward prediction error-driven Model-Free (MF) model and a state-transition relationship-driven Model-Based (MB) model. Using experimental data, we fitted these models and systematically analyzed the dynamic changes in key learning parameters, including learning rate (reflecting the speed of new information acquisition), discount factor (indicating the valuation of future rewards), and the inverse temperature parameter (representing choice certainty). Model comparisons revealed that pigeons predominantly utilized an MB strategy in early learning stages, focusing on acquiring relationships between states to form accurate value representations. With accumulated experience, pigeons progressively shifted toward the MF strategy, directly utilizing established value predictions for decision-making. Furthermore, analysis of model parameters showed that the learning rate gradually decreased, while both discount factor and inverse temperature increased over the learning period. These changes indicate that pigeons progressively place greater emphasis on future rewards and decision certainty, illustrating a natural shift from environmental exploration to exploitation of acquired knowledge. This study not only elucidates the mechanisms underlying adaptive strategy adjustments in biological systems during sequential decision-making but also provides valuable biological insights for parameter optimization in artificial reinforcement learning models.

参考文献/References:

[1] DAYAN P, DAW N D. Decision theory, reinforcement learning, and the brain[J]. Cognitive, Affective, & Behavioral Neuroscience, 2008, 8(4): 429-453.
[2] GUPTA N, AHIRWAL M K, ATULKAR M. Development of human decision making model with consideration of human factors through reinforcement learning and prospect utility theory[J]. Journal of Experimental & Theoretical Artificial Intelligence, 2024, 36(7): 1003-1019.
[3] SUTTON R S, BARTO A G. Reinforcement learning: an introduction [M]. London, England: The MIT Press, 2018.
[4] MILLER K J, VENDITTO S J C. Multi-step planning in the brain[J]. Current Opinion in Behavioral Sciences, 2021, 38: 29-39.
[5] DEHAENE S, SIGMAN M. From a single decision to a multi-step algorithm[J]. Current Opinion in Neurobiology, 2020, 62: 155-166.
[6] 张倩倩. 面向人机序贯决策的混合智能方法研究[D]. 合肥: 中国科学技术大学, 2021.
ZHANG Q Q. Research on hybrid intelligent method for man-machine sequential decision-making[D]. Hefei: University of Science and Technology of China, 2021.
[7] MATTAR M G, THOMPSON-SCHILL L S, BASSETT D S. The network architecture of value learning[J]. Network Neuroscience, 2018, 2(2): 128-149.
[8] 王东署, 杨凯. 基于状态转移学习的机器人行为决策认知模型[J]. 郑州大学学报(工学版), 2021,42(6): 7-13.
WANG D S, YANG K. Behavior decision-making cognitive model of mobile robot based on state transfer learning[J]. Journal of Zhengzhou University (Engineering Science), 2021,42(6): 7-13.
[9] 蒲慕明. 跨学科开启头脑风暴 促进学科交叉与融合[J]. 科学通报, 2023, 68(35): 4749-4750.
PU M M. Initiate interdisciplinary brainstorming, promote cross-disciplinary integration in neuroscience[J]. Chinese Science Bulletin, 2023, 68(35): 4749-4750.
[10] Huang J, Zhang Z, Ruan X. An improved dyna-Q algorithm inspired by the forward prediction mechanism in the rat brain for mobile robot path planning[J]. Biomimetics, 2024, 9(6): 315.
[11] Rescorla R A. A theory of pavlovian conditioning: variations in the effectiveness of reinforcement[J]. Current Research & Theory, 1972, 64-99.
[12] SUTTON R S. Learning to predict by the methods of temporal differences[J]. Machine Learning, 1988, 3(1): 9-44.
[13] 李琳, 李玉泽, 张钰嘉, 等. 基于多估计器平均值的深度确定性策略梯度算法[J]. 郑州大学学报(工学版), 2022, 43(2): 15-21.
LI L, LI Y Z, ZHANG Y J, et al. Deep deterministic policy gradient algorithm based on mean of multiple estimators[J]. Journal of Zhengzhou University (Engineering Science), 2022, 43(2): 15-21.
[14] 师黎, 陶梦妍, 李志辉. 鸽子强化学习过程中内部学习状态的动态建模研究[J]. 科学技术与工程, 2017, 17(13): 120-125.
SHI L, TAO M Y, LI Z H. Dynamic modeling of internal cognitive status of pigeon in the process of reinforcement learning[J]. Science Technology and Engineering, 2017, 17(13): 120-125.
[15] DAW N D, NIV Y, DAYAN P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control[J]. Nature Neuroscience, 2005, 8(12): 1704-1711.
[16] DOLL B B, DUNCAN K D, SIMON D A, et al. Model-based choices involve prospective neural activity[J]. Nature Neuroscience, 2015, 18(5): 767-772.
[17] MOMENNEJAD I. Learning structures: predictive representations, replay, and generalization[J]. Current Opinion in Behavioral Sciences, 2020, 32: 155-166.
[18] ESBER G R, SCHOENBAUM G, IORDANOVA M D. The rescorla-Wagner model: it is not what you think it is[J]. Neurobiology of Learning and Memory, 2025, 217: 108021.
[19] YANG L F, JIN F L, YANG L, et al. The hippocampus in pigeons contributes to the model-based valuation and the relationship between temporal context states[J]. Animals, 2024, 14(3): 431.
[20] VENDITTO S J C, MILLER K J, BRODY C D, et al. Dynamic reinforcement learning reveals time-dependent shifts in strategy during reward learning[J]. bioRxiv, 2024: 2024.02.28.582617.

相似文献/References:

[1]王丙琛,司怀伟,谭国真.基于深度强化学习的自动驾驶车控制算法研究[J].郑州大学学报(工学版),2020,41(04):41.[doi:10.13705/j.issn.1671-6833.2020.04.002]
 WANG Bingchen,SI Huaiwei,TAN Guozhen.Research on Autopilot Control Algorithms Based on Deep Reinforcement Learning[J].Journal of Zhengzhou University (Engineering Science),2020,41(XX):41.[doi:10.13705/j.issn.1671-6833.2020.04.002]
[2]申晓宁,毛鸣健,沈如一,等.基于深度强化学习的大规模敏捷软件项目调度[J].郑州大学学报(工学版),2023,44(05):17.[doi:10.13705/j.issn.1671-6833.2023.05.003]
 SHEN Xiaoning,MAO Mingjian,SHEN Ruyi,et al.Large-scale Agile Software Project Scheduling Based on Deep Reinforcement Learning[J].Journal of Zhengzhou University (Engineering Science),2023,44(XX):17.[doi:10.13705/j.issn.1671-6833.2023.05.003]

备注/Memo

备注/Memo:
收稿日期:2025-06-10;修订日期:2025-07-16
基金项目:国家自然科学基金资助项目(62301496) ;河南省科技攻关项目(232102210072, 252102210008)
作者简介:李志辉(1978— ) ,女,河南郑州人,郑州大学副教授,博士,主要从事认知行为的脑机制方面的研究,E-mail:lizhrain@ zzu.edu.cn。
通讯作者:杨莉芳(1992— ) ,女,河 南林州人,郑州大学博士后,主要从事生物认知行为建模,强化学习,神经信号检测与处理方向的研究,E-mai: lifang_yang1014@zzu.edu.cn。
更新日期/Last Update: 2026-01-15