交通运输系统工程与信息 ›› 2024, Vol. 24 ›› Issue (1): 132-137.DOI: 10.16097/j.cnki.1009-6744.2024.01.013

• 智能交通系统与信息技术 • 上一篇    下一篇

基于生成对抗网络的追尾事故数据填补方法研究

周备*1,张莹2,张生瑞1,周千喜1,汪琴1   

  1. 1. 长安大学,运输工程学院,西安 710064;2. 北京清华同衡规划设计研究院有限公司,北京 100085
  • 收稿日期:2023-09-26 修回日期:2023-10-17 接受日期:2023-10-23 出版日期:2024-02-25 发布日期:2024-02-12
  • 作者简介:周备(1986- ),男,河南济源人,副教授,博士
  • 基金资助:
    国家自然科学基金青年科学基金(52102404);中央高校基本科研业务费专项资金(300102343204)

Rear-end Crash Data Imputation Methods Using Generative Adversarial Networks

ZHOU Bei*1, ZHANG Ying2, ZHANG Shengrui1, ZHOU Qianxi1, WANG Qin1   

  1. 1. College of Transportation Engineering, Chang'an University, Xi'an 710064, China; 2. Beijing Tsinghua Tongheng Planning and Design Institute Co. Ltd, Beijing 100085, China
  • Received:2023-09-26 Revised:2023-10-17 Accepted:2023-10-23 Online:2024-02-25 Published:2024-02-12
  • Supported by:
    Young Scientists Fund of the National Natural Science Foundation of China (52102404);Fundamental Research Funds for the Central Universities of Ministry of Education of China, CHD (300102343204)

摘要: 深入分析交通事故数据可以为规避事故发生、降低事故严重程度提供重要理论依据,然而,在事故数据采集、传输、存储过程中往往会产生数据缺失,导致统计分析结果的准确性下降、模型的误判风险上升。本文以芝加哥2016—2021年的101452条追尾事故数据为研究对象,将原始数据按照7∶3随机分为训练集和测试集。在训练集数据上,利用生成式插补网络(Generative Adversarial Imputation Network, GAIN)实现对缺失数据的填补。为对比不同数据填补方法的效果,同时选择多重插补(Multiple Imputation by Chained Equations, MICE)算法、期望最大化(Expectation Maximization, EM)填充算法、缺失森林(MissForest)算法和 K 最近邻(K- Nearest Neighbor, KNN)算法对同一数据集进行数据填补,并基于填补前后变量方差变化比较不同填补算 法对数据变异性的影响。在完成数据填补的基础上,构建LightGBM三分类事故严重程度影响因素分析模型。使用原始训练集数据,以及填补后的训练集数据分别训练模型,并使用未经填补的测试集数据检验模型预测效果。结果表明,经缺失值填补后,模型性能得到一定改善,使用GAIN 填补数据集训练的模型,相较于原始数据训练的模型,准确率提高了 6.84%,F1 提高了 4.61%, AUC(Area Under the Curve)提高了10.09%,且改善效果优于其他4种填补方法。

关键词: 城市交通, 数据填补, 生成对抗网络, 追尾事故, LightGBM模型

Abstract: A meticulous analysis of traffic crash data can furnish pivotal theoretical foundations for averting crashes and mitigating their severity. However, data collection, transmission, and storage processes frequently engender data missingness, which consequently diminishes the accuracy of statistical analyses and elevates the risk of model misjudgments. In this research, a dataset comprising 101452 rear-end crashes between 2016 and 2021 in Chicago was examined. The original data was randomly divided into training and testing sets at a ratio of 7∶3. For the training data, missing values were imputed using a Generative Adversarial Imputation Network (GAIN). To foster a comparative assessment of various data imputation algorithms, alternative methods—including Multiple Imputation by Chained Equations (MICE), Expectation Maximization (EM) imputation, MissForest algorithm, and K-Nearest Neighbor (KNN) algorithm—were concurrently applied to the identical dataset. Subsequently, the variance alterations pre and post-imputation were analyzed to gauge the differential impacts of these methodologies on data variability. Post the fulfillment of data imputation, a three-category LightGBM model targeting crash severity analysis was constructed. Models trained with both the original and the imputed training data were established. And the original testing data were used to test the performances of different models. The results indicated that the model performance was improved after missing data imputation. The model trained with the GAIN-augmented training data manifested a 6.84% increment in accuracy, a 4.61% increment in the F1 score, and a 10.09% increment in the AUC (Area Under the Curve), thereby surpassing the improvements facilitated by the other four imputation algorithms.

Key words: urban traffic, data imputation, generative adversarial networks, rear-end crashes, LightGBM model

中图分类号: