VLA Large Model Applications in Automotive and Robotics Research Report, 2025
상품코드:1777128
리서치사:ResearchInChina
발행일:2025년 07월
페이지 정보:영문 300 Pages
라이선스 & 가격 (부가세 별도)
한글목차
2023년 7월, Google DeepMind는 VLA 아키텍처를 채택한 RT-2 모델을 발표했습니다. 대규모 언어 모델과 멀티모달 데이터 학습을 통합하여 로봇에게 복잡한 작업을 수행할 수 있는 능력을 부여합니다. 작업 정확도는 1세대 모델의 약 2배(32%-62%)에 달하며, 쓰레기 분류와 같은 시나리오에서 획기적인 제로샷 학습을 달성했습니다.
VLA의 개념은 순식간에 자동차 회사들의 주목을 받았고, 자동차의 지능형 운전 분야에 빠르게 적용되었습니다. 2024년 지능형 주행 분야에서 '엔드 투 엔드'가 가장 핫한 용어였다면, 2025년에는 'VLA'가 그 자리를 차지할 것으로 보입니다. XPeng Motors, Li Auto, DeepRoute.ai 등의 기업이 각각 VLA 솔루션을 발표하고 있습니다.
XPeng Motors는 지난 7월 G7 모델을 발표할 때, 선도적으로 차량용 VLA의 양산을 발표했습니다. Li Auto는 i8 모델에 VLA를 탑재할 계획이며, 7월 29일 기자간담회에서 공개될 전망입니다. Geely, DeepRoute.ai, iMotion 등의 기업들도 VLA를 개발하고 있습니다.
Li Auto와 XPeng Motors는 VLA 모델을 자동차에 적용할 때 증류가 먼저인지 강화학습이 먼저인지에 대해 서로 다른 솔루션을 제시했습니다.
XPen Motors의 G7 사전 판매 컨퍼런스에서 He Xiaopeng은 뇌와 소뇌를 은유로 사용하여 전통적인 엔드 투 엔드와 VLA의 기능을 설명했습니다. 그는 기존의 엔드 투 엔드 솔루션은 소뇌의 역할을 수행하여"자동차를 운전할 수 있도록"하고, VLA는 대규모 언어 모델을 탑재하여 두뇌의 역할을 수행하여"자동차를 잘 운전할 수 있도록"하는 것이라고 말했습니다.
XPeng Motors와 Li Auto는 VLA의 적용에 있어 미묘하게 다른 길을 걷고 있습니다. Li Auto는 먼저 클라우드 기반의 대규모 모델을 증류하고, 증류된 말단 측 모델로 강화학습을 수행합니다. XPeng Motors는 먼저 클라우드 기반의 대규모 모델에서 강화학습을 수행한 후, 이를 차량 단말기에 증류합니다.
2025년 5월, Li Xiang은 AI Talk에서 Li Auto의 클라우드 기반 기본 모델은 320억 개의 매개변수를 가지고 있으며, 32억 개의 매개변수 모델을 차량 단말기에 추출하고, 운전 시나리오 데이터를 통해 사후 학습과 강화학습을 거쳐 4단계에서 최종 운전자 에이전트를 최종 운전자 에이전트를 단말과 클라우드 상에 배포한다고 밝혔습니다.
XPen Motors는 또한 VLA 모델 교육 및 배포를 위해 공장을 4 개의 작업장으로 나누었습니다. 제1 워크샵은 기본 모델의 사전 훈련과 사후 훈련을 담당하고, 제2 워크샵은 모델 증류를 담당하고, 제3 워크샵은 증류 모델의 사전 훈련을 계속하고, 제4 워크샵은 XVLA를 차량 단말에 배치합니다. XPeng World Base Model의 책임자인 Liu Xianming 박사는 XPeng Motors는 클라우드에서 10억, 30억, 70억, 720억 등 여러 매개변수를 가진 XPeng World Base Model을 훈련시켰다고 말했습니다.
어떤 솔루션이 더 지능적인 주행 환경에 적합한지는 차량에 적용된 후 각 제조업체의 VLA 솔루션의 구체적인 성능에 따라 결정됩니다.
최근 맥길대학교, 칭화대학교, Xiaomi Corporation, 위스콘신대학교 매디슨 캠퍼스 연구팀이 공동으로 자율주행 분야의 VLA 모델에 대한 종합적인 총평 'A Survey on Vision-Language-Action Models for Autonomous Driving'을 발표했습니다. 이 기사에서는 VLA의 개발을 Pre-VLA(설명자로서의 VLM), Modular VLA, End-to-End VLA, Augmented VLA의 4단계로 나누어 각 단계별 VLA의 특징과 VLA의 단계별 개발 과정을 명확하게 설명합니다.
본 보고서는 자동차 및 로봇공학 분야의 VLA 대규모 모델에 대해 조사하고, 기술적 기원, 개발 단계, 응용 사례, 핵심 특징을 정리했습니다. 또한, 8개의 대표적인 VLA 구현 솔루션과 지능형 주행 및 로봇 공학 부문의 대표적인 VLA 대규모 모델을 정리하고, VLA 개발의 4가지 주요 동향을 정리했습니다.
목차
관련된 정의
제1장 VLA 대규모 모델 개요
VLA(시각, 언어, 행동 모델)의 기본적인 정의
VLA 기술 기원과 진화
VLA 대규모 모델 방법 분류
자율주행의 VLA 모델 개발 4개 단계
VLA 솔루션 응용(1)
VLA 솔루션 응용(2)
VLA 솔루션 응용(3)
VLA 솔루션 응용(4)
케이스 1 : VLA 일반화 강화
케이스 2 : VLA 계산 오버헤드
VLA의 핵심적 특성
VLA 기술 개발의 과제
제2장 VLA 기술 아키텍처, 솔루션, 동향
VLA 코어 기술 아키텍처 분석(1)
VLA 코어 기술 아키텍처 분석(2)
VLA 코어 기술 아키텍처 분석(3)
VLA 코어 기술 아키텍처 분석(4)
VLA 코어 기술 아키텍처 분석(5)
VLA 코어 기술 아키텍처 분석(6)
VLA 코어 기술 아키텍처 분석(7)
VLA 의사결정 핵심 - Chain-of-Thought(CoT) 기술
VLA 대규모 모델 구현 솔루션 개요
VLA 구현 솔루션(1) : 기존 Transformer 구조에 기반한 솔루션
VLA 구현 솔루션(2) : 사전 학습된 LLM/VLM에 기반한 솔루션
VLA 구현 솔루션(3) : Diffusion 모델에 기반한 솔루션
VLA 구현 솔루션(4) : LLM + Diffusion 모델 솔루션
VLA 구현 솔루션(5) : 비디오 생성 + 인버스 키네마틱 솔루션
VLA 구현 솔루션(6) : 명시적 엔드 투 엔드 VLA 솔루션
VLA 구현 솔루션(7) : 암묵적 엔드 투 엔드 VLA 솔루션
VLA 구현 솔루션(8) : 계층적 엔드 투 엔드 VLA 솔루션
지능형 드라이빙 VLA 모델 요약
임보디드 AI VLA 모델 요약
케이스 1
케이스 2
케이스 3
케이스 4
VLA 개발 동향(1)
VLA 개발 동향(2)
VLA 개발 동향(3)
VLA 개발 동향(4)
제3장 자동차 분야에서 VLA 대규모 모델 응용
Li Auto
XPeng Motors
Chery Automobile
Geely
Xiaomi Auto
DeepRoute.ai
Baidu Apollo
Horizon Robotics
SenseTime
NVIDIA
iMotion
XPeng Motors
Chery Automobile
Geely
Xiaomi Auto
DeepRoute.ai
Baidu Apollo
Horizon Robotics
SenseTime
NVIDIA
iMotion
제4장 로보틱스 분야에서 대규모 모델의 진보
로봇의 일반적인 기본 모델
로봇 멀티모달 대규모 모델
로봇 데이터 일반화 모델
로봇 대규모 모델 데이터세트
로봇 VLM 모델
로봇 VLN 모델
로봇 VLA 모델
로봇 세계 모델
제5장 로보틱스 분야에서 VLA의 적용 사례
AgiBot
Galbot
Robot Era
Estun
Unitree
UBTECH
Tesla Optimus
Figure AI
Apptronik
Agility Robotics
XPeng IRON
Xiaomi CyberOne
GAC GoMate
Mornine
Leju Robotics
LimX Dynamics
AI2 Robotics
X Square Robot
ksm
영문 목차
영문목차
ResearchInChina releases "VLA Large Model Applications in Automotive and Robotics Research Report, 2025"
The report summarizes and analyzes the technical origin, development stages, application cases and core characteristics of VLA large models.
It sorts out 8 typical VLA implementation solutions, as well as typical VLA large models in the fields of intelligent driving and robotics, and summarizes 4 major trends in VLA development.
It analyzes the VLA application solutions in the field of intelligent driving of companies such as Li Auto, XPeng Motors, Chery Automobile, Geely Automobile, Xiaomi Auto, DeepRoute.ai, Baidu, Horizon Robotics, SenseTime, NVIDIA, and iMotion.
It sorts out more than 40 large model frameworks or solutions such as robot general basic models, multimodal large models, data generalization models, VLM models, VLN models, VLA models and robot world models.
It analyzes the large models and VLA large model application solutions of companies such as AgiBot, Galbot, Robot Era, Estun, Unitree, UBTECH, Tesla Optimus, Figure AI, Apptronik, Agility Robotics, XPeng IRON, Xiaomi CyberOne, GAC GoMate, Chery Mornine, Leju Robotics, LimX Dynamics, AI2 Robotics, and X Square Robot.
Vision-Language-Action (VLA) model is an end-to-end artificial intelligence model that integrates three modalities: Vision, Language, and Action. Through a unified multimodal learning framework, it integrates perception, reasoning and control, and directly generates executable physical world actions (such as robot joint movement, vehicle steering control) based on visual input (such as images, videos) and language instructions (such as task descriptions).
In July 2023, Google DeepMind launched the RT-2 model, which adopts the VLA architecture. By integrating large language models with multimodal data training, it endows robots with the ability to perform complex tasks. Its task accuracy has nearly doubled compared with the first-generation model (from 32% to 62%), and it has achieved breakthrough zero-shot learning in scenarios such as garbage classification.
The concept of VLA was quickly noticed by automobile companies and rapidly applied to the field of automotive intelligent driving. If "end-to-end" was the hottest term in the intelligent driving field in 2024, then "VLA" will be the one in 2025. Companies such as XPeng Motors, Li Auto, and DeepRoute.ai have released their respective VLA solutions.
When XPeng Motors released the G7 model in July, it took the lead in announcing the mass production of VLA in vehicles. Li Auto plans to equip the i8 model with VLA, which is expected to be revealed at the press conference on July 29. Enterprises such as Geely Automobile, DeepRoute.ai and iMotion are also developing VLA.
Li Auto and XPeng Motors have given different solutions on whether VLA models should be distilled first or reinforced learning first when applied in vehicles
At the pre-sale conference of XPeng Motors' G7, He Xiaopeng used the brain and cerebellum as metaphors to explain the functions of the traditional end-to-end and VLA. He said that traditional end-to-end solution plays the role of cerebellum, "making the car able to drive", while VLA introduces a large language model, playing the role of brain, "making the car drive well".
XPeng Motors and Li Auto have taken slightly different routes in VLA application: Li Auto first distills the cloud-based base large model, and then performs reinforcement learning on the distilled end-side model; XPeng Motors first performs reinforcement learning on the cloud-based base large model, and then distills it to the vehicle end.
In May 2025, Li Xiang mentioned in AI Talk that Li Auto's cloud-based base model has 32 billion parameters, distills a 3.2 billion parameter model to the vehicle end, and then conducts post-training and reinforcement learning through driving scenario data, and will deploy the final driver Agent on the end and cloud in the fourth stage.
XPeng Motors has also divided the factory for training and deploying VLA models into four workshops: the first workshop is responsible for pre-training and post-training of the base model; the second workshop is responsible for model distillation; the third workshop continues pre-training the distilled model; the fourth workshop deploys XVLA to the vehicle end. Dr. Liu Xianming, head of XPeng's world base model, said that XPeng Motors has trained "XPeng World Base Models" with multiple parameters such as 1 billion, 3 billion, 7 billion, and 72 billion in the cloud.
Which solution is more suitable for the intelligent driving environment remains to be seen based on the specific performance of different manufacturers' VLA solutions after being applied in vehicles.
Recently, research teams from McGill University, Tsinghua University, Xiaomi Corporation, and the University of Wisconsin-Madison jointly released a comprehensive review article on VLA models in the field of autonomous driving, "A Survey on Vision-Language-Action Models for Autonomous Driving". The article divides the development of VLA into four stages: Pre-VLA (VLM as explainer), Modular VLA, End-to-end VLA and Augmented VLA, clearly showing the characteristics of VLA in different stages and the gradual development process of VLA.
There are over 100 robot VLA models, constantly exploring in different paths
Compared with the application of VLA large models in automobiles, which have tens of billions of parameters and nearly 1,000 TOPS of computing power, AI computing chips in the robotics field are still optional, and the number of parameters in training data sets is mostly between 1 million and 3 million. There are also controversies over the mixed use of real data and simulated synthetic data and routes. One of the reasons is that the number of cars on the road is hundreds of millions, while the number of actually deployed robots is very small; another important reason is that robot VLA models focus on the exploration of the microcosmic world. Compared with the grand automotive world model, the multimodal perception of robot application scenarios is richer, the execution actions are more complex, and the sensor data is more microscopic.
There are more than 100 VLA models and related data sets in the robotics field, and new papers are constantly emerging, with various teams exploring in different paths.
In May 2025, research teams from the Institute of Automation of the Chinese Academy of Sciences, Samsung Beijing Research Institute, Beijing Academy of Artificial Intelligence (BAAI), and the University of Wisconsin-Madison jointly released a paper on VTLA related to insertion manipulation tasks. The research shows that the integration of visual and tactile perception is crucial for robots to perform tasks with high precision requirements when performing contact-intensive operation tasks. By integrating visual, tactile and language inputs, combined with a time enhancement module and a preference learning strategy, VTLA has shown better performance than traditional imitation learning methods and single-modal models in contact-intensive insertion tasks.
Exploration 2: VLA model supporting multi-robot collaborative operation
In February 2025, Figure AI released the Helix general Embodied AI model. Helix can run collaboratively on humanoid robots, enabling two robots to cooperate to solve a shared, long-term operation task. In the video demonstrated at the press conference, Figure AI's robots showed a smooth collaborative mode in the operation of placing fruits: the robot on the left pulled the fruit basin over, the robot on the right put the fruits in, and then the robot on the left put the fruit basin back to its original position.
Figure AI emphasized that this is only touching "the surface of possibilities", and the company is eager to see what happens when Helix is scaled up 1000 times. Figure AI introduced that Helix can run completely on embedded low-power GPUs and can be commercially deployed immediately.
Exploration 3: Offline end-side VLA model in the robotics field
In June 2025, Google released Gemini Robotics On-Device, a VLA multimodal large model that can run locally offline on embodied robots. The model can simultaneously process visual input, natural language instructions, and action output. It can maintain stable operation even in an environment without a network.
It is particularly worth noting that the model has strong adaptability and versatility. Google pointed out that Gemini Robotics On-Device is the first robot VLA model that opens the fine-tuning function to developers, enabling developers to conduct personalized training on the model according to their specific needs and application scenarios.
VLA robots have been applied in a large number of automobile factories
When the macro world model of automobiles is integrated with the micro world model of robots, the real era of Embodied AI will come.
When Embodied AI enters the stage of VLA development, automobile enterprises have natural first-mover advantages. Tesla Optimus, XPeng Iron, and Xiaomi CyberOne robots have fully learned from their rich experience in intelligent driving, sensor technology, machine vision and other fields, and integrated their technical accumulation in the field of intelligent driving. XPeng Iron robot is equipped with XPeng Motors' AI Hawkeye vision system, end-to-end large model, Tianji AIOS and Turing AI chip.
At the same time, automobile factories are currently the main application scenarios for robots. Tesla Optimus robots are currently mainly used in Tesla's battery workshops. Apptronik cooperates with Mercedes-Benz, and Apollo robots enter Mercedes-Benz factories to participate in car manufacturing, with tasks including handling, assembly and other physical work. At the model level, Apptronik has established a strategic cooperation with Google DeepMind, and Apollo has integrated Google's Gemini Robotics VLA large model.
On July 18, UBTECH released the hot-swappable autonomous battery replacement system for the humanoid robot Walker S2, which enables Walker S2 to achieve 3-minute autonomous battery replacement without manual intervention.
According to public reports, many car companies including Tesla, BMW, Mercedes-Benz, BYD, Geely Zeekr, Dongfeng Liuzhou Motor, Audi FAW, FAW Hongqi, SAIC-GM, NIO, XPeng, Xiaomi, and BAIC Off-Road Vehicle have deployed humanoid robots in their automobile factories. Humanoid robots such as Figure AI, Apptronik, UBTECH, AI2 Robotics, and Leju are widely used in various links such as automobile and parts production and assembly, logistics and transportation, equipment inspection, and factory operation and maintenance. In the near future, AI robots will be the main "labor force" in "unmanned factories".
Table of Contents
Related Definitions
Chapter 1 Overview of VLA Large Models
1.1 Basic Definition of VLA (Vision-Language-Action Model)
1.2 Origin and Evolution of VLA Technology
1.3 Classification of VLA Large Model Methods
1.4 Four Stages of VLA Model Development in Autonomous Driving
1.5 VLA Solution Application (1)
1.5 VLA Solution Application (2)
1.5 VLA Solution Application (3)
1.5 VLA Solution Application (4)
1.6 Case 1: Enhancement of VLA Generalization
1.6 Case 2: VLA Computational Overhead
1.7 Core Characteristics of VLA
1.8 Challenges in VLA Technology Development
Chapter 2 VLA Technical Architecture, Solutions and Trends
2.1 Analysis of VLA Core Technical Architecture (1)
2.1 Analysis of VLA Core Technical Architecture (2)
2.1 Analysis of VLA Core Technical Architecture (3)
2.1 Analysis of VLA Core Technical Architecture (4)
2.1 Analysis of VLA Core Technical Architecture (5)
2.1 Analysis of VLA Core Technical Architecture (6)
2.1 Analysis of VLA Core Technical Architecture (7)