交通运输系统工程与信息 ›› 2024, Vol. 24 ›› Issue (3): 114-126.DOI: 10.16097/j.cnki.1009-6744.2024.03.012

• 智能交通系统与信息技术 • 上一篇    下一篇

基于异构多智能体自注意力网络的路网信号协调顺序优化方法

陈喜群*a,朱奕璋b,谢宁珂a,耿茂思a,吕朝锋c   

  1. 浙江大学,a. 建筑工程学院,智能交通研究所;b. 工程师学院,智能交通研究所;c. 建筑工程学院,杭州 310058
  • 收稿日期:2024-02-26 修回日期:2024-04-03 接受日期:2024-04-08 出版日期:2024-06-25 发布日期:2024-06-23
  • 作者简介:陈喜群(1986- ),男,黑龙江人,教授,博士
  • 基金资助:
    国家自然科学基金(72171210)

Coordinated Sequential Optimization for Network-wide Traffic Signal Control Based on Heterogeneous Multi-agent Transformer

CHEN Xiqun*a, ZHU Yizhangb, XIE Ningkea, GENG Maosia, LV Chaofengc   

  1. a. Institute of Intelligent Transportation Systems, College of Civil Engineering and Architecture; b. Institute of Intelligent Transportation Systems, Polytechnic Institute; c. College of Civil Engineering and Architecture, Zhejiang University, Hangzhou 310058, China
  • Received:2024-02-26 Revised:2024-04-03 Accepted:2024-04-08 Online:2024-06-25 Published:2024-06-23
  • Supported by:
    National Natural Science Foundation of China (72171210)

摘要: 针对路网交通信号控制的复杂性,本文提出基于异构多智能体自注意力网络的路网信号协调顺序优化方法,提升路网范围内多交叉口信号控制策略性能。首先,模型考虑多交叉口交通流的空间相关性,采用基于自注意力机制的价值编码器学习交通观测表征,实现路网级通信;其次,面向多智能体策略更新的非稳态环境,模型在前序智能体的联合动作基础上,基于多智能体优势分解的策略解码器,顺序决策最优反应动作;最后,设计基于有效行驶车辆的动作掩码机制,在时效完备区间自适应调节决策频率,并提出考虑等待公平性的时空压力奖励函数,进一步提高策略性能与实用性。在杭州路网数据集上验证模型有效性,结果表明:所提模型在2个数据集和5个性能指标上均优于基准模型;相比最优基准模型,所提模型平均行程时间降低10.89%,平均排队长度降低18.84%,平均等待时间降低22.21%。此外,所提模型的泛化能力更强,且显著减少车 辆等待时间过长的情形。

关键词: 智能交通, 深度强化学习, 路网信号控制, 异构多智能体, 时空压力奖励

Abstract: Focusing on the complex traffic signal control task in an urban network, this study proposes a coordinated sequential optimization method based on a Heterogeneous Multi-Agent Transformer (HMATLight) to optimize network-wide traffic signals and improve the performance of signal control policy at intersections within the urban network. Specifically, considering the spatial correlation of multi-intersection traffic flow, a value encoder based on a self-attention mechanism is first designed to learn traffic observation representations and realize network-level communication. Secondly, in response to the non-stationary environment for multi-agent policy updates, a policy decoder based on the multi-agent advantage decomposition is constructed, which can sequentially output the optimal responsive action on the basis of the joint actions of preceding agents. Besides, an action-masking mechanism based on effective driving vehicles, adapting the decision frequency within the time-adequate interval, and a spatio-temporal pressure reward function considering the waiting fairness are constructed, which further enhance policy performance and practicality. A series of experiments are carried out on Hangzhou network datasets to validate the effectiveness of the proposed method. Experimental results show that the proposed HMATLight outperforms all baselines on two datasets with five metrics. Compared with the best-performed baseline, HMATLight decreases the average travel time by 10.89%, the average queue length by 18.84% and the average waiting time by 22.21%. Furthermore, HMATLight is dramatically higher in generalization and significantly reduces instances of long vehicle waiting times.

Key words: intelligent transportation, deep reinforcement learning, network-wide traffic signal control, heterogeneous multi-agent, spatio-temporal pressure reward

中图分类号: