full screen

DrivingGen
A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Overview

DrivingGen is a comprehensive benchmark for generative world models in the driving domain with a diverse data distribution and novel evaluation metrics. DrivingGen evaluates models from both a visual perspective (the realism and overall quality of generated videos) and a robotics perspective (the physical plausibility of generated trajectories).

Benchmark Overview

Overview of our DrivingGen benchmark. Video models take vision, and optional language/action as inputs to generate videos. The generated videos are then passed into our evaluation suite. Four comprehensive and novel sets of metrics for both videos and trajectories (distribution, quality, temporal consistency, and trajectory alignment) are introduced to evaluate world models.

Distribution

How far is the generative distribution from the data distribution? We evaluate distributional closeness across both videos and trajectories, capturing complementary perspectives from visual perception and robotics.

FVD quantifies the similarity between generated videos and real videos.

Quality

How good are the generated videos and trajectories? To evaluate the fidelity of generated videos and trajectories in driving scenarios, we propose a comprehensive quality suite covering three aspects: subjective image quality, objective image quality, and trajectory quality.

we adopt CLIP-IQA+, which leverages CLIP's vision-language representations to predict perceptual quality scores consistent with human subjective assessments.

Temporal Consistency

How temporally consistent is the generated world? We assess the temporal consistency of both videos and trajectories. For videos, we evaluate scene-level consistency, agent-level consistency, and explicitly emphasize abnormal agent disappearance. For trajectories, we measure the consistency of speed and acceleration over time.

Video consistency measures temporal consistency across consecutive frames of adaptively downsampled videos with comparabble optical flow magnitude.

Trajectory Alignment

The alignment of the trajectories underlying the generated videos with the conditioning (ego) trajectory is also critical, especially for trajectory-grounded video generation. To assess this, we propose two complementary metrics.

ADE measures the mean pointwise distance between the generated and input trajectories across the prediction horizon.

Abstract

we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset—curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers—with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.

Dataset

We organize our dataset into two complementary tracks, offering distinct perspectives for evaluating driving videos. Open-Domain Track is designed to evaluate models’ generalization to open-domain, diverse, unseen driving scenarios. Ego-Conditioned Track complements the open-domain track, focusing on trajectory controllability, measuring how well the trajectories derived from generated videos align with the given ego-trajectory instructions. we show the data distribution and representative examples of the open-domain track in our benchmark. Please see for the entire dataset.
Dataset distribution Dataset geographic cities Dataset gallery
▶ For the open-domain track, we construct this track using Internet-sourced data spanning multiple cities and regions worldwide, ensuring broad coverage beyond the training distribution. 1) We limit normal weather and daytime clips to below 60%. 2) We collect data from a wide range of regions worldwide. 3) We cover diverse scenarios such as dense city traffic at night, unusual weather, and complex interactions.
▶ For the ego-conditioned track, we aggregate data from five open-source driving datasets: Zod (Europe), DrivingDojo (China), COVLA (Japan), nuPlan (US), and WOMD (US) to measure how well the trajectories derived from generated videos align with the given ego-trajectory instructions.

Evaluation Results

Evaluation results of 14 generative world models on our benchmark. We provide the full table of metrics in a transparent way to evaluate the models comprehensively and the average rank serves as a quick summary but not a definitive score. Best results are highlighted in red region, second best in orange region, and third best in blue region. “*” indicates commercial closed-source models. Models fall into four categories: closed-source, open-source general video models, physical-world models, and driving-specific models.
Open-Domain Track
Models
Size Distribution Quality Temporal Consistency Avg. Rank
FVD FTD Subjective
Quality
Objective
Quality
Trajectory
Quality
Video
Consist.
Agent
Consist.
Agent
Missing
Trajectory
Consist.
Kling 2.1* - 693.4 26.73 0.5538 0.8018 0.6438 0.8945 0.7981 0.9442 0.5377 1
Gen-3 Alpha Turbo* - 801.0 93.50 0.5456 0.8378 0.6535 0.8900 0.8170 0.9495 0.4788 2
LTX-Video 13B 648.2 31.29 0.5215 0.8288 0.5562 0.8851 0.7449 0.8977 0.4517 3
Wan2.2-I2V 14B 609.0 63.86 0.5348 0.6396 0.5983 0.8883 0.7514 0.9128 0.4639 4
HunyuanVideo-I2V 13B 957.5 30.95 0.4921 0.7207 0.4613 0.8821 0.8008 0.9306 0.4157 5
SkyReels-V2-I2V 14B 876.0 52.93 0.5134 0.7432 0.4799 0.8776 0.7329 0.9078 0.4326 7
CogVideoX 5B 621.2 236.7 0.4932 0.6802 0.3856 0.8211 0.7581 0.7661 0.2949 12
Cosmos-Predict2 14B 524.1 83.20 0.4931 0.7568 0.5990 0.8597 0.5912 0.8657 0.3997 8
Cosmos-Predict1 14B 821.1 81.22 0.5083 0.7207 0.2723 0.8429 0.6789 0.8796 0.2631 13
Vista 2.5B 675.7 54.66 0.4340 0.8468 0.6030 0.8565 0.6357 0.8211 0.4040 6
VaViM 1.2B 1446.6 449.2 0.4691 0.8468 0.3118 0.9159 0.7721 0.9752 0.0914 9
UniFuture 3.0B 774.3 50.66 0.4206 0.9054 0.4507 0.8799 0.5373 0.8310 0.3858 10
GEM 2.1B 770.1 147.1 0.5168 0.8423 0.5398 0.8176 0.6099 0.7788 0.3392 11
Drivingdojo 2.3B 810.4 126.74 0.4202 0.8333 0.4511 0.8480 0.6256 0.8303 0.2739 14
Ego-Conditioned Track
Models
Size Distribution Quality Temporal Consistency Trajectory Alignment Avg. Rank
FVD FTD Subjective
Quality
Objective
Quality
Trajectory
Quality
Video
Consist.
Agent
Consist.
Agent
Missing
Trajectory
Consist.
ADE DTW
Kling 2.1* - 320.5 23.74 0.5468 0.7838 0.6860 0.8929 0.8186 0.9712 0.5430 29.97 2310 1
Gen-3 Alpha Turbo* - 555.9 24.72 0.5740 0.8604 0.6770 0.8747 0.7986 0.9466 0.4800 33.39 2749 3
Wan2.2-I2V 14B 194.4 29.56 0.5084 0.6982 0.6419 0.8821 0.7561 0.9034 0.4849 27.39 1901 2
LTX-Video 13B 378.1 61.09 0.4895 0.8604 0.5464 0.8705 0.7708 0.9020 0.4442 32.12 2505 6
HunyuanVideo-I2V 13B 532.9 21.18 0.4741 0.6847 0.5542 0.8792 0.8240 0.9415 0.4771 33.80 2794 7
CogVideoX 5B 307.1 166.6 0.4884 0.6937 0.4252 0.8167 0.7541 0.8981 0.3783 32.67 2413 10
SkyReels-V2-I2V 14B 428.2 57.02 0.4764 0.6622 0.5028 0.8661 0.7208 0.8750 0.4322 31.54 2594 11
Cosmos-Predict2 14B 260.5 56.26 0.4756 0.8198 0.6424 0.8428 0.6707 0.8986 0.4108 22.38 1490 4
Cosmos-Predict1 14B 345.2 34.96 0.4783 0.7505 0.3761 0.8229 0.7423 0.7961 0.3343 34.47 3084 13
Vista 2.5B 392.8 27.33 0.4146 0.8198 0.6047 0.8741 0.6417 0.8676 0.4366 19.70 1216 5
UniFuture 3.0B 654.6 37.17 0.4006 0.9685 0.5353 0.8759 0.5525 0.8759 0.4165 20.21 1352 8
VaViM 1.2B 1222 103.6 0.4910 0.8694 0.1936 0.9428 0.8290 0.9725 0.0984 41.92 3863 9
Drivingdojo 2.3B 586.5 35.73 0.4264 0.8198 0.4131 0.8419 0.6940 0.8439 0.2776 25.50 2142 12
GEM 2.1B 579.9 97.70 0.4484 0.8018 0.5085 0.7886 0.6180 0.7463 0.2983 25.73 1982 14

• This website is built upon WorldScore. We thank the authors of WorldScore that kindly open sourced the template of this website.