交通运输系统工程与信息 ›› 2026, Vol. 26 ›› Issue (1): 261-269.DOI: 10.16097/j.cnki.1009-6744.2026.01.024

• 系统工程理论与方法 • 上一篇    下一篇

考虑数据平衡影响的道路交通事故建模与致因分析

杨洋*,陈冠华,王明涛,黄海博   

  1. 北京交通大学,交通运输学院,北京100044
  • 收稿日期:2025-10-17 修回日期:2025-11-14 接受日期:2025-11-21 出版日期:2026-02-25 发布日期:2026-02-15
  • 作者简介:杨洋(1988—),男,河北邢台人,副教授,博士。
  • 基金资助:
    国家自然科学基金(52572336);北京市自然科学基金(E2024210149)。

Modeling and Causation Analysis Towards Road Traffic Accidents Considering Data Balance

YANG Yang*, CHEN Guanhua, WANG Mingtao, HUANG Haibo   

  1. School of Traffic and Transportation, Beijing Jiaotong University, Beijing 100044, China
  • Received:2025-10-17 Revised:2025-11-14 Accepted:2025-11-21 Online:2026-02-25 Published:2026-02-15
  • Supported by:
    National Natural Science Foundation of China (52572336);Beijing Natural Science Foundation (E2024210149)。

摘要: 为解决道路交通事故数据中普遍存在的样本不平衡问题,并精准识别事故严重程度的关键致因,本文提出一种基于混合采样与可解释机器学习的分析框架。针对传统模型在非平衡数据下对少数类(严重事故)预测性能差的缺陷,本文先采用ADASYN(Adaptive Synthetic Sampling)过采样与Tomek Links欠采样相结合的混合方法对美国Kaggle交通事故数据集进行平衡处理;随后,分别构建基于逻辑回归、K近邻、决策树和随机森林4种机器学习预测模型;引入混淆矩阵作为评价体系,评估模型对各类事故严重程度的预测能力;最后,应用SHAP(SHapley Additive exPlanations)算法对模型关键影响因素进行分析。结果显示:混合采样显著提升了模型性能,随机森林模型表现最优,其F1值达到0.798,较非平衡数据下提升25.7%;随机森林模型特征重要度分析结果表明,交通事故的主要影响因素为昼夜情况、温度、湿度、能见度和风速,且低能见度、高湿度情况下易引起较严重的交通事故。结论表明,本文采用的混合采样方法能有效提升模型对严重事故的识别精度;SHAP分析进一步揭示夜间、低能见度和高湿度等环境组合是诱发严重事故的关键风险场景,为交通安全精准预警与干预提供科学支撑。

关键词: 交通工程, 事故致因分析, 混合采样, 道路交通事故, 数据不平衡性, 机器学习

Abstract: To address the prevalent issue of sample imbalance in road traffic accident data, this study proposes an analytical framework that integrates hybrid sampling with interpretable machine learning to accurately identify the key causal factors of accident severity. To tackle the poor predictive performance of traditional models on the minority class (severe accidents) under imbalanced data conditions, this study first employed a hybrid method combining ADASYN (Adaptive Synthetic Sampling) over- sampling and Tomek Links under-sampling to balance the US Kaggle traffic accident dataset. Subsequently, four machine learning models—Logistic Regression, K-Nearest Neighbors, Decision Tree, and Random Forest—were constructed. The predictive capabilities of these models for different severity levels were evaluated using a confusion matrix. Finally, the SHAP (SHapley Additive exPlanations) algorithm was applied to analyze the key influencing factors of the best-performing model. The results show that: the hybrid sampling strategy significantly improved model performance. The Random Forest model performed optimally, with its F1-score reaching 0.798, an increase of 25.7% compared to the model trained on the imbalanced data. The feature importance analysis of the Random Forest model revealed that the primary influencing factors are day/night condition, temperature, humidity, visibility, and wind speed. Furthermore, it was found that conditions of low visibility and high humidity are prone to lead to more severe accidents. The conclusions indicate that the proposed hybrid sampling method effectively enhances the identification accuracy of model for severe accidents. The SHAP analysis further reveals that the combinations of environmental factors, such as nighttime, low visibility, and high humidity, constitute the key risk scenarios for inducing severe accidents, providing a scientific support for precise traffic safety warnings and interventions.

Key words: traffic engineering, accident causation analysis, hybrid sampling, road traffic accident, data imbalance, machine learning

中图分类号: