交通运输系统工程与信息 ›› 2026, Vol. 26 ›› Issue (1): 283-294.DOI: 10.16097/j.cnki.1009-6744.2026.01.026

• 系统工程理论与方法 • 上一篇    下一篇

进口集装箱堆存决策的两阶段强化学习方法

宋丽英*1a,1b,邓琨琦1a,1b,宁武2,宋海涛3,李四维1a,1b   

  1. 北京交通大学,a.交通运输学院,b.中国综合交通研究中心,北京100044;2.广西北部湾国际港务集团有限公司,南宁530022;3. 广西钦州保税港区宏港码头有限公司,广西钦州535008
  • 收稿日期:2025-09-25 修回日期:2025-11-30 接受日期:2025-12-17 出版日期:2026-02-25 发布日期:2026-02-17
  • 作者简介:宋丽英(1978—),女,北京人,教授,博士。
  • 基金资助:
    广西科技重大专项-江海联运重点货种运输组织方案关键技术研究(桂科AA23062021-2)。

Two-stage Reinforcement Learning Method for Stacking Decisions of Import Container

SONG Liying*1a,1b, DENG Kunqi1a,1b, NING Wu2, SONG Haitao3, LI Siwei1a,1b   

  1. 1a. School of Traffic and Transportation, 1b. Integrated Transport Research Center of China, Beijing Jiaotong University, Beijing 100044, China; 2. Beibu Gulf Port Group, Nanning 530022, China; 3. Guangxi Qinzhou Bonded Port Honggang Wharf Co Ltd, Qinzhou 535008, Guangxi, China
  • Received:2025-09-25 Revised:2025-11-30 Accepted:2025-12-17 Online:2026-02-25 Published:2026-02-17
  • Supported by:
    Guangxi Science and Technology Major Special Project-Research on Key Technologies for the Transportation Organization Scheme of Key Cargo Types in River-Sea Intermodal Transport (桂科AA23062021-2)。

摘要: 进口集装箱堆存问题因卸船顺序与提箱顺序的矛盾以及堆场资源约束而呈现高度复杂性。针对这一挑战,本文面向自动化垂直布局堆场,提出一种基于深度强化学习的两阶段堆存决策方法。该方法将堆存过程建模为马尔可夫决策过程,在框架上引入“堆区决策-堆位决策”的分阶段结构,有效降低状态与动作空间的维度,并结合差异化奖励函数,将均衡堆区利用率、翻箱次数和提箱移动距离作为优化目标。算法设计上,第1阶段采用深度Q网络(DQN)实现堆区选择,第2阶段引入对偶深度Q网络(DuelingDQN)提升复杂状态下的堆位选择效率。实验结果表明,该方法能够在全堆场范围内形成均衡的堆存策略:在不同堆场密度和集装箱批量场景下均表现出稳定适应性,平均翻箱率控制在15%~27%,平均移动贝位数最大值为3.84贝·箱-1,分别较实际数据降低约61.5%与38.7%。与单阶段DQN、两阶段近端策略优化(PPO)和启发式算法相比,本文方法在收敛效率、决策效果和鲁棒性方面均具有明显优势。本文不仅验证了分阶段建模与差异化奖励机制在复杂堆存问题中的有效性,还为大规模自动化堆场的调度与资源优化提供了具有推广性的解决方案。

关键词: 物流工程, 堆存决策, 强化学习, 进口集装箱, 两阶段方法

Abstract: The stacking problem of import containers is highly complex due to conflicts between unloading and retrieval sequences and yard resource constraints. This study focuses on automated vertical yards and proposes a two-stage stacking decision method based on deep reinforcement learning. The process is modeled as a Markov decision process with a phased structure of "block- selection-slot selection" to reduce the dimensionality of state and action spaces. A differentiated reward function is designed: block level decisions promote balanced yard utilization, while slot-level decisions minimize relocations and retrieval distances. In algorithm design, the Deep Q-Networks (DQN) is used for block selection and Dueling DQN for slot selection. Simulation results show that the proposed method produces balanced strategies across the yard and adapts well to different yard densities and container batch scenarios. The average relocation rate is controlled at 15%~27%, and the maximum retrieval distance is 3.84 bays per container, representing reductions of about 61.5% and 38.7% compared with historical yard data. Compared to the single- stage DQN, two-stage Proximal Policy Optimization (PPO), and heuristic optimization, the proposed method achieves faster convergence, fewer relocations, and shorter retrieval paths. These results confirm the effectiveness of phased modeling and differentiated rewards in complex stacking problems and provide a practical solution for intelligent scheduling and resource optimization in large-scale automated yards.

Key words: logistics engineering, stacking decision, reinforcement learning, import containers, two-stage method

中图分类号: