考虑数据平衡影响的道路交通事故建模与致因分析

doi:10.16097/j.cnki.1009-6744.2026.01.024

交通运输系统工程与信息 ›› 2026, Vol. 26 ›› Issue (1): 261-269.DOI: 10.16097/j.cnki.1009-6744.2026.01.024

考虑数据平衡影响的道路交通事故建模与致因分析

杨洋^*，陈冠华，王明涛，黄海博

北京交通大学，交通运输学院，北京100044

收稿日期:2025-10-17 修回日期:2025-11-14 接受日期:2025-11-21 出版日期:2026-02-25 发布日期:2026-02-15
作者简介:杨洋（1988—），男，河北邢台人，副教授，博士。
基金资助:
国家自然科学基金(52572336)；北京市自然科学基金(E2024210149)。

Modeling and Causation Analysis Towards Road Traffic Accidents Considering Data Balance

YANG Yang^*, CHEN Guanhua, WANG Mingtao, HUANG Haibo

School of Traffic and Transportation, Beijing Jiaotong University, Beijing 100044, China

Received:2025-10-17 Revised:2025-11-14 Accepted:2025-11-21 Online:2026-02-25 Published:2026-02-15
Supported by:
National Natural Science Foundation of China (52572336)；Beijing Natural Science Foundation (E2024210149)。

摘要/Abstract

摘要： 为解决道路交通事故数据中普遍存在的样本不平衡问题，并精准识别事故严重程度的关键致因，本文提出一种基于混合采样与可解释机器学习的分析框架。针对传统模型在非平衡数据下对少数类（严重事故）预测性能差的缺陷，本文先采用ADASYN（Adaptive Synthetic Sampling）过采样与Tomek Links欠采样相结合的混合方法对美国Kaggle交通事故数据集进行平衡处理；随后，分别构建基于逻辑回归、K近邻、决策树和随机森林4种机器学习预测模型；引入混淆矩阵作为评价体系，评估模型对各类事故严重程度的预测能力；最后，应用SHAP（SHapley Additive exPlanations）算法对模型关键影响因素进行分析。结果显示：混合采样显著提升了模型性能，随机森林模型表现最优，其F1值达到0.798，较非平衡数据下提升25.7%；随机森林模型特征重要度分析结果表明，交通事故的主要影响因素为昼夜情况、温度、湿度、能见度和风速，且低能见度、高湿度情况下易引起较严重的交通事故。结论表明，本文采用的混合采样方法能有效提升模型对严重事故的识别精度；SHAP分析进一步揭示夜间、低能见度和高湿度等环境组合是诱发严重事故的关键风险场景，为交通安全精准预警与干预提供科学支撑。

关键词: 交通工程, 事故致因分析, 混合采样, 道路交通事故, 数据不平衡性, 机器学习

Abstract: To address the prevalent issue of sample imbalance in road traffic accident data, this study proposes an analytical framework that integrates hybrid sampling with interpretable machine learning to accurately identify the key causal factors of accident severity. To tackle the poor predictive performance of traditional models on the minority class (severe accidents) under imbalanced data conditions, this study first employed a hybrid method combining ADASYN (Adaptive Synthetic Sampling) over- sampling and Tomek Links under-sampling to balance the US Kaggle traffic accident dataset. Subsequently, four machine learning models—Logistic Regression, K-Nearest Neighbors, Decision Tree, and Random Forest—were constructed. The predictive capabilities of these models for different severity levels were evaluated using a confusion matrix. Finally, the SHAP (SHapley Additive exPlanations) algorithm was applied to analyze the key influencing factors of the best-performing model. The results show that: the hybrid sampling strategy significantly improved model performance. The Random Forest model performed optimally, with its F1-score reaching 0.798, an increase of 25.7% compared to the model trained on the imbalanced data. The feature importance analysis of the Random Forest model revealed that the primary influencing factors are day/night condition, temperature, humidity, visibility, and wind speed. Furthermore, it was found that conditions of low visibility and high humidity are prone to lead to more severe accidents. The conclusions indicate that the proposed hybrid sampling method effectively enhances the identification accuracy of model for severe accidents. The SHAP analysis further reveals that the combinations of environmental factors, such as nighttime, low visibility, and high humidity, constitute the key risk scenarios for inducing severe accidents, providing a scientific support for precise traffic safety warnings and interventions.

Key words: traffic engineering, accident causation analysis, hybrid sampling, road traffic accident, data imbalance, machine learning

中图分类号:

U491.31

杨洋, 陈冠华, 王明涛, 黄海博. 考虑数据平衡影响的道路交通事故建模与致因分析[J]. 交通运输系统工程与信息, 2026, 26(1): 261-269.

YANG Yang, CHEN Guanhua, WANG Mingtao, HUANG Haibo. Modeling and Causation Analysis Towards Road Traffic Accidents Considering Data Balance[J]. Journal of Transportation Systems Engineering and Information Technology, 2026, 26(1): 261-269.

导出引用管理器 EndNote|Ris|BibTeX

链接本文: http://www.tseit.org.cn/CN/10.16097/j.cnki.1009-6744.2026.01.024

http://www.tseit.org.cn/CN/Y2026/V26/I1/261

参考文献

[1] PAL C, HIRAYAMA S, NARAHARI S, et al. An insight of World Health Organization (WHO) accident database by cluster analysis with self-organizing map (SOM)[J]. Traffic Injury Prevention, 2018, 19(sup1): S15-S20.

[2]KOOPMAN P, WAGNER M. Autonomous vehicle safety: An interdisciplinary challenge[J]. IEEE Intelligent Transportation Systems Magazine, 2017, 9(1): 90-96.

[3]杨洋,王文慧,吴先宇,等.高速公路非常规交通事故研究综述[J]. 应用基础与工程科学学报,2024,32(3): 601-626. [YANG Y, WANG W H, WU X Y, et al. Review of the research toward freeway unconventional traffic accidents[J]. Journal of Basic Science and Engineering, 2024, 32(3): 601-626.]

[4]许洪国,张慧永,宗芳.交通事故致因分析的贝叶斯网络建模[J]. 吉林大学学报(工学版), 2011, 41(S1): 89- 94. [XU H G, ZHANG H Y, ZONG F. Bayesian network modeling for causation analysis of traffic accident[J]. Journal of Jilin University (Engineering and Technology Edition), 2011, 41(S1): 89-94.

[5]张韡,吴晓多,白骞,等.基于多项Logit模型的山区高速公路事故致因分析[J].科学技术与工程,2024,24 (5): 2111-2117. [ZHANG W, WU X D, BAI Q, et al. Analysis of direct causes of highway accidents in mountainous areas based on multinomial Logit model[J]. Science Technology and Engineering, 2024, 24(5): 2111- 2117.]

[6]唐玉洁,焦朋朋,王健宇,等.基于随机参数Logit模型的夜间行人-机动车事故严重程度致因分析[J].北京建筑大学学报,2025,41(2): 62-71. [TANG Y J, JIAO P P, WANG J Y, et al. Analysis of the causes of severity in nighttime pedestrian-motor vehicle accidents based on random parameter Logit model[J]. Journal of Beijing University of Civil Engineering and Architecture, 2025, 41(2): 62-71.]

[7]袁振洲,娄晨,杨洋.时间差异条件下的高速公路交通事故致因分析[J].北京交通大学学报,2021,45(3): 1- 7. [YUAN Z Z, LOU C, YANG Y. Analysis of highway traffic accidents causes under time differences[J]. Journal of Beijing Jiaotong University, 2021, 45(3): 1-7.]

[8]杨洋,陈献天,王健宇,等.面向夜间道路交通事故的VSSM-CNN检测网络构建[J/OL]. 交通运输工程学报, (2025-10-31) [2025-11-12]. https://doi.org/10.19818/j. cnki.1671-1637.2026.076. [YANG Y, CHEN X T, WANG J Y, et al. Construction of VSSM-CNN detection network for nighttime road traffic accidents[J]. Journal of Traffic and Transportation Engineering, (2025-10-31) [2025-11-12]. https://doi.org/10.19818/j.cnki.1671- 1637.2026.076.]

[9]王朝健,张道文,蒋骏,等.考虑数据不平衡的城市道路乘用车致命事故率分析[J].交通信息与安全,2023, 41(5): 43-53. [WANG Z J, ZHANG D W, JIANG J, et al. An analysis of fatal accident rates of passenger cars on urban roads considering imbalanced data samples[J]. Journal of Transportation Information and Safety, 2023, 41(5): 43-53.]

[10] 王博文, 王景升, 吴恩重. 面向不平衡数据集的SMOTENC-XGBoost驾驶人交通安全评估模型[J]. 科学技术与工程, 2023, 23(2): 831-837. [WANG B W, WANG J S, WUEZ. SMOTENC-XGBoost driver traffic safety assessment model for unbalanced dataset[J]. Science Technology and Engineering, 2023, 23(2): 831- 837.]

[11] 李嘉璐. 基于CTGAN及模型可解释性的不平衡交通事故严重程度影响因素研究[D]. 西安: 长安大学, 2023. [LI J L. Analysis of factors contributing to imbalanced crash severity based on CTGAN and model interpretability [D]. Xi'an: Chang'an University, 2023.]

[12] 潘义勇,徐翔宇.数据不平衡的MobileViT网络交通事故严重程度预测模型[J]. 吉林大学学报(工学版), 2025, 55(3): 947-953. [PAN Y Y, XU X Y. Model for predicting severity of accidents based on MobileViT network considering imbalanced data[J]. Journal of Jilin University (Engineering and Technology Edition), 2025, 55(3): 947-953.]

[13] 杨洋, 贺昆, 王云鹏,等.面向动态交通流的高速公路事故风险模型空间移植研究[J].交通运输系统工程与信息,2023, 23(3): 174-186. [YANG Y, HE K, WANG Y P, et al. Spatial transplantation for modeling of freeway traffic crash risk based on dynamic traffic flow[J]. Journal of Transportation Systems Engineering and Information Technology, 2023, 23(3): 174-186.]

[14] MOOSAVI S, SAMAVATIAN M H, PARTHASARATHY S, et al. Accident risk prediction based on heterogeneous sparse data: New dataset and insights[C]//Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, New York: Association for Computing Machinery, 2019: 33-42.

[15] MOOSAVI S, SAMAVATIAN M H, PARTHASARATHY S, et al. A countrywide traffic accident dataset[EB/OL]. (2019-06-12) [2025-11-12]. https://arxiv.org/abs/1906.05409.

考虑数据平衡影响的道路交通事故建模与致因分析

Modeling and Causation Analysis Towards Road Traffic Accidents Considering Data Balance

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	王福建, 马佳豪, 李廷浩, 马东方. 混合交通环境下基于动态决策间隔的强化学习信号控制方法[J]. 交通运输系统工程与信息, 2026, 26(1): 45-54.
[2]	杜太升, 彭正鍾, 张源凯, 田琼, 蒋晓桐. 考虑乘客偏好的需求响应定制公交线路优化[J]. 交通运输系统工程与信息, 2026, 26(1): 217-227.
[3]	郑展骥, 吴程宇, 王振科, 饶嘉强, 凃强, 徐进. 城市地下道路接入口驾驶人眼动特征影响机制研究[J]. 交通运输系统工程与信息, 2026, 26(1): 329-339.
[4]	罗霜, 陈宽, 李善兴, 徐进. 基于实车试验的小净距“隧道-互通”心理负荷量化分析[J]. 交通运输系统工程与信息, 2025, 25(6): 341-349.
[5]	韩宝睿, 纪宇轩, 李根, 杨政, 颜荣添, 徐圣睿. 基于粘滞性车辆组团识别的道路偶发拥堵预警研究[J]. 交通运输系统工程与信息, 2025, 25(5): 91-102.
[6]	尹超英, 周悦, 徐震宇, 邵春福, 王晓全, 齐欣. 建成环境与主观感知对共享单车地铁换乘出行的非线性影响[J]. 交通运输系统工程与信息, 2025, 25(5): 169-178.
[7]	张文会, 乔梓凡, 陈德启. 出租车轨迹数据驱动的充电站选址定容方法[J]. 交通运输系统工程与信息, 2025, 25(5): 291-301.
[8]	郑立勇, 孙剑, 饶红玉, 邵健轩, 赵威, 郝勇刚. 基于物理信息深度学习的交叉口车辆轨迹补全方法[J]. 交通运输系统工程与信息, 2025, 25(4): 116-125.
[9]	荣建, 吴培佳, 高亚聪, 王益, 窦灏. 融合仿真与机器学习的交织区通行能力协同估计方法[J]. 交通运输系统工程与信息, 2025, 25(4): 206-218.
[10]	许左前, 陈海龙, 李连进, 张顶, 陈红. 道路养护作业区车辆冲突预测及关键因素判别[J]. 交通运输系统工程与信息, 2025, 25(4): 230-240.
[11]	戢晓峰, 李金, 普永明, 卢梦媛, 韩春阳. 穿村镇公路车辆跟驰冲突暴露时间生存分析[J]. 交通运输系统工程与信息, 2025, 25(4): 349-360.
[12]	孟云伟, 王磊, 李智鹏, 李斌斌, 青光焱, 刘中帅. 基于因子分析与熵权法的山区双车道公路驾驶视觉负荷研究[J]. 交通运输系统工程与信息, 2025, 25(4): 361-372.
[13]	覃文文, 彭栋梁, 戢晓峰, 徐迎豪, 李冰, 李武, 曾浩. 山区双车道公路借道超车轨迹预测模型[J]. 交通运输系统工程与信息, 2025, 25(3): 96-106.
[14]	龙科军, 寇诗雨, 邢璐, 高志波, 唐幼仪, 费怡. 考虑货车影响的匝道合流区车辆微观换道行为建模[J]. 交通运输系统工程与信息, 2025, 25(3): 117-131.
[15]	程国柱, 陈永胜, 孟凤威, 徐亮. 客货混行异质交通流下高速公路专用道车流管理方法[J]. 交通运输系统工程与信息, 2025, 25(3): 178-189.