엔드 투 엔드(E2E) 자율주행(2025년)
End-to-End Autonomous Driving Research Report, 2025
상품코드 : 1744404
리서치사 : ResearchInChina
발행일 : 2025년 05월
페이지 정보 : 영문 500 Pages
 라이선스 & 가격 (부가세 별도)
US $ 4,500 ₩ 6,547,000
Unprintable PDF (Single User License) help
PDF 보고서를 1명만 이용할 수 있는 라이선스입니다. 인쇄 불가능하며, 텍스트의 Copy&Paste도 불가능합니다.
US $ 6,800 ₩ 9,893,000
Printable & Editable PDF (Enterprise-wide License) help
PDF 보고서를 동일 기업의 모든 분이 이용할 수 있는 라이선스입니다. 인쇄 가능하며 인쇄물의 이용 범위는 PDF 이용 범위와 동일합니다.


한글목차

엔드 투 엔드 자율 주행의 본질은 대규모 고품질 인간 운전 데이터를 통해 운전 행동을 모방하는 데 있습니다. 기술적 관점에서 보면, 모방 학습 기반 접근 방식은 인간 수준의 운전 성능에 접근할 수 있지만, 인간 인지 한계를 넘어서는 데 어려움을 겪습니다. 또한 고품질 시나리오 데이터의 부족과 운전 데이터셋의 데이터 품질 불균형은 엔드 투 엔드 솔루션이 인간 수준 능력을 달성하는 것을 극히 어렵게 만듭니다. 높은 확장성 장벽은 추가적인 장애물로 작용하며, 이러한 시스템은 일반적으로 훈련을 위해 수백만 개의 고품질 운전 클립이 필요합니다.

특히, RL 프레임워크는 상호작용 환경에서 추론 체인을 자율적으로 생성하는 데 우수하며, 이는 대규모 모델이 Chain-of-Thought (CoT) 능력을 개발할 수 있도록 합니다. 이는 논리적 추론 효율성을 크게 향상시키며 인간 인지 한계를 넘어서는 잠재력을 열어줍니다. 세계 모델로 생성된 시뮬레이션 환경과 상호작용함으로써 엔드 투 엔드 자율 주행 모델은 실제 세계의 물리적 규칙에 대한 더 깊은 이해를 얻습니다. 이 RL 기반 기술적 접근 방식은 알고리즘 개발에 새로운 방법을 제시하며 전통적인 모방 학습의 한계를 돌파할 가능성을 보여줍니다.

I VLA 패러다임으로 E2E 모델 전환

E2E 모델은 신경망을 통해 시각적 입력과 운전 궤적 출력을 직접 매핑합니다. 그러나 물리적 세계 동역학에 대한 내재적 이해가 부족해 명시적인 의미 이해나 논리적 추론 없이 작동합니다. 이들은 언어 명령, 교통 규칙, 텍스트 정보 등을 해석하지 못하며, 제한된 3D 공간 인식으로 인해 장꼬리 시나리오에서의 일반화 능력이 부족합니다.

시각-언어-행동(VLA) 패러다임은 아키텍처에 대규모 언어 모델(LLM)을 통합함으로써 중요한 개선을 도입합니다. 이는 원래 단일 모달리티 시각-행동 시스템을 시각, 언어, 행동을 결합한 다모달 프레임워크로 전환합니다. LLM의 도입은 자율 주행 시스템에 인간과 유사한 일반 상식과 논리적 추론을 주입하여 데이터 중심의 “약한 AI”에서 인지 지능 중심의 “일반화 시스템”으로 전환합니다.

II VLA 모델의 훈련 과정 및 강화 학습의 적용

대규모 언어 모델(LLM)의 후속 훈련에서 강화 학습(RL)은 점점 더 널리 채택되고 있습니다. 예를 들어, 올해 주목받은 모델인 DeepSeek-R1은 RL을 핵심 훈련 방법으로 활용했습니다. 적절한 보상 메커니즘을 설계함으로써 RL은 기초 모델의 추론 능력을 효과적으로 활성화했습니다. 이 기술적 우위는 언어 모델에서 처음 입증된 후 자율주행 산업에서 주목받기 시작했으며, 다수 제조사가 RL을 ADAS 기술에 통합하고 있습니다.

VLA 모델의 훈련은 “기본 모델 사전 훈련”과 “도메인 미세 조정” 두 단계로 나뉩니다. 사전 훈련 단계에서는 대규모 데이터를 통해 모델에 맥락 이해, 논리적 추론 등 일반적인 인지 능력을 부여합니다. 지능형 운전 분야에서의 미세 조정 단계에서는 감독 학습을 통해 기본 운전 규칙(차선 유지, 장애물 인식 등)을 수립한 후, 강화 학습(RL)을 통해 핵심 기능을 업그레이드합니다. 강화 학습은 자연어 처리 분야의 성공 사례(인간 선호도와의 RLHF 정렬 등)를 활용해 “개방 루프 + 폐쇄 루프” 메커니즘을 통해 운전 시나리오에서의 의사결정을 최적화합니다. 개방 루프 단계에서는 역사적 인수 데이터를 활용해 안전 논리를 보정하며, 폐쇄 루프 단계에서는 가상 장면 생성 기술(월드 모델 등)을 통해 극한 작업 조건을 시뮬레이션해 모델이 적극적으로 시도와 오류를 반복하며 전략을 반복 개선해, 대규모 라벨링 데이터에 의존하는 전통적 엔드 투엔드 모델의 한계를 극복합니다.

이 보고서는 중국 자동차 산업에 대한 조사 분석을 통해 엔드 투 엔드(E2E) 자율 주행 기술 로드맵 및 개발 동향, 국내외 공급업체 등의 정보를 제공합니다.

목차

제1장 엔드 투 엔드 지능형 운전 기술의 기초

제2장 E2E-AD의 기술 로드맵과 개발 동향

제3장 엔드 투 엔드 지능형 드라이빙 공급업체

4장 OEM의 E2E 지능형 운전 레이아웃

HBR
영문 목차

영문목차

End-to-End Autonomous Driving Research: E2E Evolution towards the VLA Paradigm via Synergy of Reinforcement Learning and World Models

The essence of end-to-end autonomous driving lies in mimicking driving behaviors through large-scale, high-quality human driving data. From a technical perspective, while imitation learning-based approaches can approach human-level driving performance, they struggle to transcend human cognitive limits. Additionally, the scarcity of high-quality scenario data and uneven data quality in driving datasets make it extremely challenging for end-to-end solutions to reach human-level capabilities. The high scalability threshold further complicates progress, as these systems typically require millions of high-quality driving clips for training.

Following the industry buzz around the DeepSeek-R1 model in early 2025, its innovative reinforcement learning (RL)-only technical path demonstrated unique advantages. This approach achieves cold startup with minimal high-quality data and employs a multi-stage RL training mechanism, effectively reducing dependency on data scale for large model training. This extension of the "scaling laws" enables continuous model expansion. Innovations in RL can also be transferred to end-to-end autonomous driving, enhancing environmental perception, path planning, and decision-making with greater precision. This lays the foundation for building larger, more capable intelligent models.

Crucially, RL frameworks excel at autonomously generating reasoning chains in interactive environments, enabling large models to develop Chain-of-Thought (CoT) capabilities. This significantly improves logical reasoning efficiency and even unlocks potential beyond human cognitive constraints. By interacting with simulation environments generated by world models, end-to-end autonomous driving models gain deeper understanding of real-world physical rules. This RL-driven technical path offers a novel approach to algorithm development, promising to break traditional imitation learning limitations.

I Transition of End-to-End Models towards the VLA Paradigm

End-to-end models directly map visual inputs to driving trajectory outputs via neural networks. However, lacking intrinsic understanding of physical world dynamics, these models operate without explicit semantic comprehension or logical reasoning. They fail to interpret verbal commands, traffic rules, or textual information. Furthermore, their limited 3D spatial perception restricts generalization in long-tail scenarios.

The Visual-Language-Action (VLA) paradigm introduces critical improvements by integrating Large Language Models (LLMs) into the architecture. This transforms the original single-modality vision-action system into a multimodal framework combining vision, language, and action. The inclusion of LLMs injects human-like common sense and logical reasoning into autonomous driving systems, transitioning from data-driven "weak AI" to cognitive intelligence-driven "generalist systems."

VLA Input: Signals received from cameras, navigation systems, maps, and other devices. These signals are processed by two encoders:

Vision Encoder: Encodes image data to extract high-level features of the road environment.

Text Encoder: Processes text information generated from human-vehicle interactions, such as voice commands or parameter settings.

VLA Output:

Trajectory Decoder: Converts model-generated information into specific trajectory signals, outlining the vehicle's driving plan for the next 10 to 30 seconds, including speed control and route details.

Text Decoder: Simultaneously generates natural language explanations for the decisions. For example, when detecting a pedestrian crossing the road, the system not only plans a deceleration-and-stop trajectory but also outputs a textual explanation like "Pedestrian crossing detected; slowing down and stopping." This ensures transparency in decision-making.

Core Breakthroughs of VLA

World Model Construction: VLA extracts rich environmental information from sensor data, leverages language models to interpret human instructions, and generates explainable decision-making processes. It then translates multimodal inputs into actionable driving commands.

Chain-of-Thought (CoT) Reasoning: VLA autonomously generates reasoning chains in interactive environments, enabling logical inference capabilities that surpass human cognitive limits. With the support of large models, VLA enhances visual and spatial understanding beyond traditional end-to-end approaches.

By 2030, VLA-centric end-to-end solutions are projected to dominate over 50% of L3/L4 autonomous driving markets, reshaping the value chain of traditional Tier-1 suppliers.

Li Auto's MindVLA Evolution Path

In 2025, Li Auto integrated End-to-End (E2E) and Visual Language Model (VLM) approaches into an advanced Vision-Language-Action (VLA) architecture. This shift addressed critical limitations in the previous dual-system framework:

Constraint 1: The dual-system pipeline processed inputs (cameras, LiDAR, vehicle pose, navigation) through a 3D encoder and directly output trajectories via an action decoder. However, its discriminative AI-based E2E model lacked generalization and common-sense reasoning, struggling with long-tail scenarios. While simple, it failed in complex spatial understanding and language interactions.

Constraint 2: System 2 (semantic assistant) supported System 1 (E2E) but suffered from technical shortcomings. Existing VLMs relied on 2D front-camera data, neglecting omnidirectional inputs from surround-view and rear cameras. Additionally, System 2 lacked robust 3D spatial understanding-a core requirement for trajectory planning.

Under the VLA architecture, V-Spatial Intelligence is responsible for 3D data processing and spatial understanding, equipped with a 3D word segmentation device; L-Linguistic Intelligence uses MindGPT large-scale language models to fuse spatial markers and process semantic information; and A-Action Policy integrates decisions through collective action generators to generate action trajectories. The MindVLA architecture enhances the tokenization of spatial information (3D Tokenizer), scene understanding of language models (MindGPT), and collective action generation capabilities (Collective Action Generator), enabling VLA to maintain strong spatial language reasoning capabilities while realizing the collective modeling and alignment of the characteristics of the three modes of vision, language, and action in a unified space, which is expected to solve the intelligent decision-making requirements in future complex scenarios.

II Training Process of VLA Models and Application of Reinforcement Learning

In post-training of Large Language Models (LLMs), reinforcement learning (RL) has become increasingly prevalent. For instance, DeepSeek-R1, a standout model this year, leveraged RL as a core training method. By designing appropriate reward mechanisms, RL effectively activated the foundational model's reasoning capabilities. This technical advantage, initially proven in language models, has now drawn attention in the autonomous driving industry, with multiple manufacturers integrating RL into their ADAS technologies.

The training of the VLA model is divided into two stages: "pre-training of the base model" and "domain fine-tuning". In the pre-training stage, the model is given general cognitive abilities through massive data, such as understanding context, logical reasoning, etc. In the fine-tuning stage of the intelligent driving field, basic driving rules (such as lane keeping and obstacle recognition) need to be established through supervised learning, and then key upgrades need to be completed with the help of reinforcement learning (RL). Reinforcement learning draws on the successful experience in natural language processing (such as RLHF alignment of human preferences), and optimizes decision-making in driving scenarios through the "open-loop + closed-loop" mechanism: the open-loop stage uses historical takeover data to calibrate safety logic, and the closed-loop stage simulates extreme working conditions through virtual scene generation technology (such as world model), allowing the model to actively trial and error and iterate strategies, breaking through the limitations of traditional end-to-end models relying on massive amounts of labeled data.

Imitation Learning (IL)

The scenario cloning strategy (BC) in imitation learning, the core of which is to formulate a strategy by learning driving trajectory of experts such as human drivers. In the field of intelligent driving, this method mainly relies on analyzing large amounts of driving data to imitate human driving behavior. Its advantage is that it is simple to implement and computationally efficient, but its shortcomings are also obvious - it is difficult to deal with unseen special scenarios or abnormal situations.

From the perspective of the training mechanism, the scenario cloning strategy adopts an open-loop approach and relies on the driving demonstration data of the distribution law. However, real driving is a typical closed-loop process, and the subtle deviations of each step may accumulate over time, forming compound errors and triggering unknown scenarios. As a result, the strategy trained by scenario cloning often performs poorly in unfamiliar situations, and its robustness has attracted industry attention.

The principle of reinforcement learning (RL) is to optimize action strategies through reward functions:

The reinforcement learning model continuously interacts in the simulated traffic scene and relies on the reward mechanism to adjust and optimize the driving strategy. This way, the model can learn more reasonable decisions in the complex and dynamic traffic environment. However, reinforcement learning has obvious shortcomings in practical applications: on the one hand, the training efficiency is not high, and a lot of trial and error is required to obtain the usable model; on the other hand, it cannot be trained directly in the real road environment - after all, the real driving scene cannot afford frequent trial and error, and the cost is too high. Most of the current simulation training is based on the sensor data generated by the game engine, and the real environment relies on the information of the object itself rather than the sensor input, resulting in the gap between the simulation results and the actual scene.

Another problem is the human behavior alignment: the reinforcement learning exploration process may cause the model strategy to deviate from human driving habits and act incoherently. To address this, imitation learning is often integrated as a regularization term during RL training, incorporating human driving data to align policies with human behavior.

Li Auto's MindVLA Training Methodology

Stage I: The training process of Li Auto's VLA model is divided into four stages: VL visual language base model pre-training, assisted driving post-training, assisted driving reinforcement learning, and driver agent construction. Among them, the pre-training of VL base model is the core link of the entire training system. In the early dual-system stage, Li Auto used Ali Tongyi Qianwen's Qwen-VL visual language model, but when developing the latest VL base model, by partially integrating DeepSeek language model capability, Li Xiang said that the 9-month research and development cycle was effectively shortened, saving hundreds of millions of yuan in development costs.

Based on the pre-trained base model, Li Auto further optimizes the technology, and generates a small vehicle end-specific model with 3.60 billion parameters through model distillation technology to meet the deployment requirements of the vehicle computing platform.

Stage II & III:

The final goal of the VLA model trained in the cloud is to be applied on the vehicle platform. Due to the difference between the computing power of the vehicle and the cloud, the cloud model needs to be distilled and optimized by model compression techniques such as pruning and quantization. The specific method of Li Auto is: after completing the training of the VL base model with 32 billion parameters, it is first distilled into a 4 billion parameter model adapted to the computing power conditions of the vehicle. On this basis, reinforcement learning training is carried out to ensure that the model can not only meet the operation requirements of the vehicle computing platform, but also maintain sufficient decision-making ability.

The second and third stages of Li Auto VLA model training - assisted driving post-training and reinforcement learning can be seen as fine-tuning the base model in the field of intelligent driving. Among them, the post-training stage adopts the open-loop imitation learning method of the traditional end-to-end solution, while the reinforcement learning stage combines the open-loop and closed-loop modes to become the core improvement point of VLA model training.

Specifically:

Open-loop reinforcement learning: Using RLHF's reinforcement learning mechanism based on human feedback, the main goal is to adapt driving strategies to human driving habits and safety standards.Li Auto uses the accumulated human takeover vehicle data for training, so that the model clearly distinguishes between "reasonable operation" and "dangerous behavior", and completes the calibration of basic driving logic.

Closed-loop reinforcement learning (RL): High-intensity iterative training of the model by building a world model to generate a large number of virtual training and simulation scenarios. This method breaks the limitation of traditional reliance on real road condition data, greatly reduces the time and cost of actual road testing, and achieves an improvement in training efficiency.

These two phases complete the crucial transition from the basic model to the dedicated driving model by combining "first alignment of human preferences and then deep optimization through virtual scenarios".

III Synergistic Applications of World Models and RL

World models are pivotal for end-to-end autonomous driving training, evaluation, and simulation. They generate realistic synthetic videos from sensor inputs and vehicle states, enabling safe, controlled virtual environments for strategy assessment and physical rule comprehension.

RL Training Mechanism:

The essence of the world model is a model based on neural networks, which can establish a correlation model between environmental states, action choices, and feedback rewards, and directly guide the behavioral decision-making of the agent. In intelligent driving scenarios, this model can generate optimal action strategies based on real-time environmental states. More importantly, it can build a virtual interactive environment close to real dynamics, providing a closed-loop training platform for reinforcement learning - the system continuously receives reward feedback in the simulation environment and continuously optimizes the strategy.

Through this mechanism, the two core capabilities of the end-to-end model are expected to be significantly improved: one is the perception ability, the recognition accuracy and understanding ability of environmental elements such as vehicles, pedestrians, obstacles, etc., and the other is the predictive ability, the predictive accuracy of other traffic participants' behavioral intentions. This whole chain optimization from perception to decision-making is the core value of empowering intelligent driving by the world model.

Huawei's recently released "Qiankun Smart Driving ADS 4" also applies world model technology. In its "World Engine + World Behavior Model (WEWA) " technical architecture, the "World Engine" in the cloud is responsible for generating various extremely rare driving scenarios and converting these scenarios into "training questions" for the intelligent driving system, just like the "question-setting examiner" for the simulation test. The "World Behavior Model" on the vehicle side has full-modal perception and multi-MoE expert decision-making capabilities, acting as a "practical instructor", allowing the intelligent driving system to accumulate experience in dealing with complex scenarios in the simulation environment, and realize the ability to advance from theory to practice.

The cloud world base model recently released by Xpeng takes large language model as the core architecture, completes training through massive high-quality multi-modal driving data, and has the ability of visual semantic understanding, logical chain reasoning, and driving action generation. At present, the team is focusing on the development of a super-large-scale world base model with 72 billion parameter scale. The cloud model constructs the whole process technology link from base model pre-training, reinforcement learning post-training, model distillation, vehicle model pre-training, and deployment on the car. The whole system adopts the technical route of combining reinforcement learning and model distillation, which can efficiently produce end-side deployment models with small volume and high intelligence level.

Li Auto's World Model Application:

In the field of intelligent driving, reinforcement learning (RL) faces the problem of training bias due to insufficient environmental authenticity. MindVLA relies on its self-developed cloud unified world model, which integrates reconstruction and generation technologies. Among them, the reconstruction model has the ability to restore 3D scenes, while the generative model can realize new perspective completion and unseen perspective prediction. By combining these two technical paths, MindVLA constructs a simulation environment that is close to the real world and conforms to the laws of the physical world, providing an effective solution to solve the problem of training bias.

The world model covers various traffic participants and environmental elements to build a virtual real traffic world. It uses a self-supervised learning framework to realize dynamic 3D scene reconstruction based on multi-perspective RGB images to generate scene representations containing multi-scale geometric features and semantic information. The scene is modeled in the form of a 3D Gaussian point cloud, and each Gaussian point integrates parameters such as position, color, transparency, and covariance matrix to efficiently render light and shadow and spatial structures in complex traffic environments.

Relying on the strong simulation capabilities of the world model, MindVLA can carry out millions of kilometers of driving simulations in the cloud virtual 3D environment, replacing some real vehicle road tests, and accurately verifying real-world problems at low cost, which significantly improves the efficiency and effectively responds to many challenges posed by model black box. Through massive simulation testing and optimization in the world model, the VLA can continuously improve its own decision-making and behavior, and truly "learn from mistakes". Ensure safety and reliability in actual driving.

Table of Contents

1 Foundation of End-to-end Intelligent Driving Technology

2 Technology Roadmap and Development Trends of E2E-AD

3 End-to-end Intelligent Driving Suppliers

4 End-to-end Intelligent Driving Layout of OEMs

(주)글로벌인포메이션 02-2025-2992 kr-info@giikorea.co.kr
ⓒ Copyright Global Information, Inc. All rights reserved.
PC버전 보기