[시장보고서]조종석 에이전트 엔지니어링(2025)

조종석 에이전트 엔지니어링(2025)

Cockpit Agent Engineering Research Report, 2025

상품코드 : 1882073

리서치사 : ResearchInChina

발행일 : 2025년 11월

페이지 정보 : 영문 248 Pages

라이선스 & 가격 (부가세 별도)

한글목차

샘플 요청 목록에 추가

액션 : 라스트 마일 미션

2023년에 기반 모델이 차량에 탑재된 이래, 조종석 AI 어시스턴트는 각 단계에서 다른 임무를 담당해 왔습니다. 2025년 조종석 AI 어시스턴트는 액션에 중점을 둡니다. 이는 '제안하기'가 아니라 '사용자의 수행을 돕는다'는 것을 의미하며, '어시스턴트'에서 '에이전트'로의 변화에서 중요한 단계입니다.

2025년 조종석 AI 어시스턴트의 전형적인 시나리오로는 레스토랑에서 식사 주문을 들 수 있습니다.

2024년 기준에서 사용자가 커피를 주문하고 싶은 경우 조종석 AI 어시스턴트는 지도에서 근처의 커피숍을 검색하기만 했으며, 사용자 자신이 수동으로 선택·네비게이션을 할 필요가 있었습니다. 주문이나 결제는 모두 유저 자신이 실시해, AI 어시스턴트는 일절 지원할 수 없었습니다.

2025년까지 사용자가 커피를 주문하면 조종실 AI 어시스턴트는 사용자의 의도를 확인하고 주문 및 결제와 같은 일련의 작업을 자동으로 완료할 수 있습니다. 이렇게하면 사용자가 걱정할 필요가 없으며 사용자 경험이 최적화됩니다.

이 프로세스 전체에는 장기 기억, 툴 호출, 멀티 에이전트 연계와 관련된 기술이 포함되어 있습니다.

사례 1 : 도구 호출

2024년 초, OPEN AI의 Function Calling은 조종실 요원이 툴을 호출할 때 주류 기술로 단일 모델과 단일 툴 간의 직접적인 상호작용을 가능하게 했습니다.

2024년 11월에 Anthropic이 탑재한 Model Context Protocol(MCP)은 Function Calling을 기반으로 한 '멀티 컴포넌트 연계'의 문제를 해결하여 Function Calling의 응용 시나리오와 효율성을 향상시켰습니다.

2025년 4월 Google은 서로 다른 에이전트 간의 통신 및 연계를 더욱 표준화하기 위해 A2A(Agent2Agent) 프로토콜을 제안했습니다.

예를 들어, 2025년 Lixiang Tongxue의 에이전트 용도 솔루션에는 MCP/A2A 기술 프레임워크(다른 프레임워크로서 CUA도 존재)가 포함되어 있습니다.

MCP/A2A : IVI 에이전트가 다중 에이전트 시스템(MAS)의 리더 역할을 하고 타사 에이전트에 작업을 할당하고 각 에이전트가 각 워크플로를 완료합니다.

CUA(Cockpit Using Agent) : 운영 체제가 멀티모달 기반 모델을 호출하고, 지시/작업을 이해, 분해 및 계획하고, 최종 작업을 생성한 후, 애플릿이나 앱을 호출하여 지시/태스크를 완료합니다. 예를 들어, 결제 시나리오에서 일련의 이해와 계획 후에 Lixiang Tongxue는 API를 호출하여 Alipay의 자동차 도우미에 연결하고 Alipay의 생태계를 통해 관련 애플릿을 사용하여 결제를 완료합니다.

교육 과정에서 Lixiang Tongxue 팀은 에이전트 강화 학습 단계의 보상 모듈 최적화에서 MCP 관리 도구 서비스를 활용합니다. 구체적으로는 MCP 허브를 사용하여 교육 태스크 및 업무 요구에 대응하는 호출 가능한 툴 리소스의 카탈로그를 제공합니다.

다음 단계에서는 다모태 능력 강화와 COA(Chain of Action) 구현을 계획하고 있습니다. 이것은 동일한 모델이 지속적으로 외부 툴을 호출하는 방법을 추론하고 문제 해결과 액션을 수행하는 것을 의미하며 툴 호출, 추론, 행동이 다른 모듈 간의 연계를 더욱 강화합니다.

이 보고서는 중국 자동차 산업에 대한 조사 분석을 통해 조종석 에이전트의 현재 상태, 엔지니어링 단계 및 R&D 기술 로드맵, 주요 OEM 에이전트의 특징 등에 대한 정보를 제공합니다.

문제 1 : 「엣지 클라우드」전개에 있어서의 연산 능력의 밸런스 포인트
문제 2: 멀티 에이전트 시스템의 아키텍처 설계(1)
문제 2: 멀티 에이전트 시스템의 아키텍처 설계(2)
문제 2: 멀티 에이전트 시스템의 아키텍처 설계(3) - 기본 모델 선택
문제 3: 비즈니스 모델 설계
문제 4: 시나리오 적용 효과
문제 5: 트레이닝 바이어스
문제 6: 데이터 프라이버시

SHW

영문 목차

영문목차

Cockpit Agent Engineering Research: Breakthrough from Digital AI to Physical AI

Cockpit Agent Engineering Research Report, 2025 starts with the status quo of cockpit agents, summarizes the technical roadmap of the R&D and engineering stages and the characteristics of agents from leading OEMs, and predicts the future trends and priorities of cockpit agent application.

Action: Last Mile Mission

Since foundation models were installed in vehicles in 2023, cockpit AI assistants have assumed different tasks at different stage. In 2025, cockpit AI assistants focus on action, which means they "help users get things done" instead of "giving suggestions", marking an important step in the transformation from "assistants" to "agents".

One typical scenario for cockpit AI assistants in 2025 is ordering food at restaurants:

In 2024, when a user wanted to order coffee, a cockpit AI assistant could only find nearby coffee shops on the map for the user to manually select and navigate to, but the ordering and payment were all fulfilled by the user himself/herself while the AI assistant cannot help at all.

By 2025, when a user orders coffee, the cockpit AI assistant will be able to confirm his/her intention and automatically complete a series of operations such as ordering and payment, without the user having to worry about it, thus optimizing the user experience.

The entire process involves technologies related to long-term memory, tool calling, and multi-agent collaboration.

1.Case 1: Tool Calling

In early 2024, OPEN AI's Function Calling was the mainstream technology used by cockpit agents when calling tools, facilitating the direct interaction between a single model and a single tool.

The Model Context Protocol (MCP) introduced by Anthropic in November 2024 addresses the issue of "multi-component collaboration" on basis of Function Calling and improves the application scenarios and efficiency of Function Calling.

In April 2025, Google proposed the A2A (Agent2Agent) protocol to further standardize the communication and collaboration between different agents.

For example, the agent application solution of Lixiang Tongxue in 2025 includes a MCP/A2A technical framework (another framework is CUA):

MCP/A2A: The IVI agent acts as the leader of the multi-agent system (MAS), assigning tasks to third-party agents, which then complete their respective workflows.

CUA (Cockpit Using Agent): The operating system calls a multimodal foundation model to understand instructions/tasks, decomposes and plans them, generates the final action, and calls applets and apps to complete the instructions/tasks. For example, in the payment scenario, after a series of understandings and plans, Lixiang Tongxue calls the API to connect to Alipay's automotive assistant, and uses the relevant applet through Alipay's ecosystem to complete the payment.

During the training process, the team of Lixiang Tongxue uses the MCP management tool service in the Reward module optimization of the agent reinforcement phase, such as using MCP Hub to provide a catalog of callable tool resources for training tasks and business requests.

In the next phase, Lixiang Tongxue plans to strengthen its multimodal capabilities and implement COA (Chain of Action), which means that the same model continuously thinks about how to call external tools to solve problems and take action, further improving the synergy between different modules for tool calling, reasoning and action.

2.Case 2: GUI Agent

A GUI agent (graphical user interface agent) is a specific LLM agent which processes user commands or requests in natural language, understands the current state of the GUI through screenshots or UI element trees, and performs actions that simulate human-computer interaction, thus spanning various software interfaces.

A GUI agent typically includes modules such as operating environment, prompt project, model inference, action, and memory.

GUI agent technology is still far from fully mature, but some OEMs, including Li Auto, Geely, and Xiaomi, have already started to deploy it.

In the ordering scenario aforementioned, Lixiang Tongxue leverages GUI agent technology when selecting a meal package, so that it can operate the screen components automatically without user intervention. The team of Lixiang Tongxue has pointed out that the operation accuracy of the GUI agent will also affect the final action of the CUA framework (because the payment process requires scanning screenshots, which involves the GUI agent). If the accuracy is too low, it may be difficult to guarantee a stable experience for complex tasks such as registering for parking and paying parking fees.

For example, Xiaomi has launched a GUI agent framework "BTL-UI", which uses a Group Relative Policy Optimization (GRPO) algorithm within a Markov decision process (MDP). The agent should receive the current screen state, user commands, and historical interaction records at each time step, and then output a structured BTL response, converting the input multimodal information into a comprehensive output that includes visual attention zones, reasoning processes, and command execution.

Its implementation methods and core technologies include:

Bionic interaction framework: Based on the BTL-UI model, it simulates human visual attention allocation (blinking), logical reasoning (thinking), and precise execution (action), supporting complex multi-step tasks (such as cross-application calls and multimodal interactions).

Automated data generation: It automatically analyzes screenshots, identifies the interface elements most relevant to user commands, and generates high-quality attention annotations for these zones.

BTL reward mechanism: It meticulously evaluates each cognitive stage in between, checking whether the AI correctly identifies the relevant interface elements, performs reasonable logical reasoning, and generates accurate operation instructions.

OEMs are currently transitioning from L2 reasoners to L3 agents, with L3 further divided into four stages.

According to OPEN AI's definition of AGI, Chinese OEMs are currently in the process of transitioning from L2 reasoners to L3 agents. At each different stage, different problems should be solved, with corresponding characteristics:

At present, most OEMs' cockpit AI assistants have delivered "professional services" to a certain extent. The next goal is to achieve "emotional resonance" and overcome the hurdle of "proactive prediction".

For "emotional resonance", NIO offers "Nomi" as a leading player.

In 2025, most AI assistants' emotional chats are implemented primarily through tone changes simulated by TTS technology, terminology from the knowledge base (such as colloquial interjections), and preset emotional scenario workflows. Compared to other cockpit agents, Nomi has two unique advantages:

1.Physical shell: Nomi can materialize more than 200 dynamic expressions through its shell "Nomi Mate" (upgraded to version 3.0 as of November 2025), giving emotional value in the real world. For example, when Nomi interacts with people via voice, it simulates the head movements that occur when people are talking to each other, and simulates the movement of a person's head turning towards the source of a sound when they hear a sound, thus achieving an arc-shaped head turning trajectory.

2.Emotional settings:

In terms of architecture, a dedicated "emotion engine" module is set up. Through three sub-modules, namely "contextual intelligence", "personalized intelligence" and "emotional expression", it uses voice, vision and multimodal perception technologies to achieve contextual arbitration, derive a series of understandings of the current situation, and realize natural human-like reactions in emotional scenarios.

In terms of settings, Nomi can have a personality. Based on the settings, it can perform search associations through a streaming prediction model similar to GPT, exhibiting unique situational responses and providing a personalized experience for each user (such as simulating multiple MBTI personalities, in contrast to Lixiang Tongxue set as ENFJ).

After achieving "proactive prediction," cockpit agents make a breakthrough from digital AI to physical AI.

Starting from L3.5+, generalization has become one of the limiting factors for agents' ability to flexibly cope with multi-scenario tasks. To improve generalization in different scenarios, agents should not only learn policies (what actions to take in a certain state), but also know about dynamic environmental models (how the world will change after performing an action) to make predictions in direct interaction with the environment.

To avoid limitations caused by the shortage of high-quality data, one solution is to learn in a real physical environment to achieve a breakthrough from digital AI to physical AI.

For example, the team of Lixiang Tongxue has found that the effect of data on improving the model's capabilities decreases after using massive Internet data for training the base model, namely the marginal benefit of scaling law in model pre-training declines.

Therefore, the team of Lixiang Tongxue has changed the training method for the next stage. it will focus on the interaction between the model and the physical world. Through reinforcement learning, the model will judge the correctness of the thinking process and accumulate experience and data in the interaction with the environment.

Fei-Fei Li's team from World Labs has proposed "augmented interactive agents," which feature multimodal capabilities with "cross-reality-agnostic" integration and incorporate an emergent mechanism.

In training intelligent agents, Fei-Fei Li's team has introduced an "in-context prompt" or "implicit reward function" to capture key features of expert behavior. The intelligent agents can be trained by physical world behavior data learned from expert demonstrations for task execution. The data is collected by gathering expert demonstrations in the physical world in the form of "state-action pairs".

In 2025, most OEMs chose a multi-agent approach to build their cockpit AI systems. Multi-agent collaboration is also one of the ways to improve the generalization of agents. Through "domain specialization + scenario linkage + group learning", the generalization limitations of existing agents can be broken through from multiple dimensions.

For example, GAC's "Beibi" agent can recognize intent in complex scenarios through multi-agent collaboration based on foundation model intent recognition, tackling the problems of vertical agents like "lack of unified interaction entry and inefficient collaboration". It eliminates the need for users to operate multiple agents separately (such as adjusting navigation and air conditioning individually), thus improving collaboration efficiency. Its principles include:

Build the core intelligent agent: Fine-tune the pre-trained language model using a pre-set dataset related to automotive scenarios (such as vehicle control, navigation, and other instruction records) to obtain an intent recognition model. Then, build an "intent understanding intelligent agent" based on this model, while adding a caching service to improve response speed.

Parse user intent: Receive user commands (such as voice or touch commands), and infer the intent recognition result (including 1-3 intents and their corresponding confidence scores, e.g., "Find a gas station" confidence score 0.85, "Adjust temperature" confidence score 0.9) from the intent understanding agent, and cache the commands and results.

Call collaborative agents: Make collaborative decisions based on the current scenario (such as driving status, weather), call on target agents related to the intent (such as navigation and vehicle control agents) to work together, and receive the action results of each agent.

Arbitrate, feed back and enforce: Arbitrate based on historical confidence scores (the past success rate of the agents) and the current action result; arbitrate based on the intent recognition model when there are no historical scores, and finally feed back the result to the actuation system (such as the IVI or voice broadcast) to complete the operation.

Definition

1 Status Quo and Trends of Cockpit Agents

1.1 Overview of Cockpit Agents
Definition and Value
Functional Features and Workflow
Reference Architecture (1): Classic Module Design Applied
Reference Architecture (1): Derivative Module Design Applied
Reference Architecture (1): Derivative Module Design Applied: Functional Module Design Requirements (1)-(2)
Reference Architecture (2): Multi-Agent System Module Design
Reference Architecture (2): Multi-Agent System Module Design: Components and Their Functions
Reference Architecture (2): Multi-Agent System Module Design: Components and Their Features (1)-(8)
Reference Architecture (2): Multi-Agent System Module Design: Architecture Diagram
Reference Architecture (3): Agent Architecture Design: By Different Deployment Levels
Collaboration Mechanism between Cockpit Agents, LLMs and OS
1.2 Overview of Cockpit Agent Scenarios
Classification of Cockpit Agent Application Scenarios (1)
Classification of Cockpit Agent Application Scenarios (2)
Typical Agent Scenarios (1): Workflow Decomposition of MAS in Mobility Scenarios (1)-(5)
Typical Agent Scenarios (2): Workflow Decomposition of MAS in Entertainment Scenarios (1)-(4)
Typical Agent Scenarios (3): Workflow Decomposition of MAS in Children Scenarios (1)-(2)
Typical Agent Scenarios (4): Workflow Decomposition of MAS in Emotional Scenarios (1)-(2)
Typical Agent Scenarios (5): Workflow Decomposition of MAS in Q&A Scenarios (1)-(2)
Typical Agent Scenarios (6): Workflow Decomposition of MAS in Education Scenarios (1)-(2)
Typical Agent Scenarios (7): Workflow Decomposition of MAS in Parking Scenarios (1)-(3)
Typical Agent Scenarios (8): Workflow Decomposition of MAS in Shopping Scenarios
Typical Agent Scenarios (9): Workflow Decomposition of MAS in Medical Scenarios
Typical Agent Scenarios (10): Workflow Decomposition of MAS in Office Scenarios (1)-(2)
Agent Scenario Cases (1)
Agent Scenario Cases (2)
Agent Scenario Cases (3)
1.3 Status Quo of Cockpit Agents
Development History of Agents
OEM Agent Comparison
Comparison of Three Development Models for Automotive AI Agents: Advantages/Disadvantages
Comparison of Three Development Models for Automotive AI Agents: Cost
1.4 Development Trends of Cockpit Agents
5 Levels of AGI: Main Application Issues
Four Stages of Cockpit Agent Iteration
Agent Trends (1)
Agent Trends (2)
Agent Trends (3)
Agent Trends (4)
Agent Trends (5)
Agent Trends (5): Cases
Agent Trends (6): Key Goals of L3.5+ Agents: High-Frequency Emergence
Agent Trends (6): Key Goals of L3.5+ Agents: Emergent Technology Foundation
Agent Trends (6): Key Goals of L3.5+ Agents: Typical Emergent Scenarios
Agents with Emergent Capabilities (1): Interactive Agents
Agents with Emergent Capabilities (2): "Emergence" Mechanisms of Interactive Agents
Agents with Emergent Capabilities (3): Training Methods of Interactive Agents
Agents with Emergent Capabilities (4):
Agents with Emergent Capabilities (5): Two Strategies Accelerate "High-Level Emergence"
Agents with Emergent Capabilities (6):

2 OEM Agent Solutions

Overview Diagram of Cockpit AI Agents/AI Assistants in 2025
Overview Table of Cockpit AI Agent/AI Assistant in 2025
2.1 Lixiang Tongxue
Upgrade to Agent
Ordering Scenario Analysis
Payment Scenario Analysis
Agent Architecture: Two Paths
R&D Insights (1): Focus of Agent Performance Improvement
R&D Insights (2): Planning 2.0
R&D Insights (3): Interactive Scenario Design and Evaluation
Functional Module Diagram
Underlying capabilities: Base Model Performance Improvement
Underlying Capabilities: The Base Model Adds Agent Task Training (1)-(4)
Underlying Capabilities: Different Paths to Enhance Base Model Capabilities (1)-(6)
Underlying Capabilities: Base Model Engineering Capability Optimization Solution (1)-(6)
Underlying Capabilities: Base Model Engineering Capability Optimization Solution - Training Platform
Underlying Capabilities: Base Model Engineering Capability Optimization Solution - Inference Engine
Underlying Capabilities: From Models to CUA
Underlying Capabilities: Base Model Agent Capability Enhancement Solution
Underlying Capabilities: Full-Modal Foundation Models
Underlying Capabilities: Application Scenarios of Full-Modal Foundation Models (1): Speech Knowledge Q&A
Underlying Capabilities: Application Scenarios of Full-Modal Foundation Models (2)
Underlying Capabilities: Application Scenarios of Full-Modal Foundation Models (3)
Underlying Capabilities: Model Capability Assessment of Full-Modal Foundation Models (1)-(2)
Underlying Capabilities: Tool Capability Assessment of Full-Modal Foundation Models (1)-(2)
2.2 NIO
NomiGPT and NomiAgent Deployment Architecture
Functional Modules of NomiGPT
Functions of NomiGPT (1): Multimodal Perception
Functions of NomiGPT (2): Command Distribution
Functions of NomiGPT (3):
Highlights of NomiGPT (1): EAI (1)-(2)
Highlights of NomiGPT(2): Emotional Interaction (1)-(4)
Highlights of NomiGPT (3)
2.3 Xpeng
Cockpit Focuses on Edge AI
Cockpit AI Functions and Planning
2.4 Geely
5-Layer Architecture of Geely Agent System
OS Functional Features of Geely Agent (1)-(3)
Functions of Galaxy M9
Architecture of ZEEKR Agent
ZEEKR Cockpit Agent Scenarios (1): Life Services
ZEEKR Cockpit Agent Scenarios (2)
2.5 Xiaomi
Application Scenarios of "XiaoAi Tongxue"
Architecture of "XiaoAi Tongxue"
GUI Agent Technology (1)
GUI Agent Technology (2)
2.6 Great Wall Motor
Coffee Agent System (1): Application Scenarios
Coffee Agent System (2): Built on AI OS
Coffee Agent System: Cooperation Dynamics
2.7 BAIC
Agent Platform Architecture: Baimo Huichuang
Agent Architecture
2.8 SAIC
IM Introduces Alibaba Agent System (1): Functions
IM Introduces Alibaba Agent System (2): Features
Roewe Intelligent Assistant Base: Doubao
2.9 Chery
Agent Brain System
Agent System Cooperation and Planning
2.10 Others
Functions of GAC Agent
Cooperation Dynamics of BYD Agent

3 Supplier Agent Solutions

3.1 Huawei
Agent System
HarmonySpace 5: MoLA
Agent Underlying Capabilities: LLM Architecture
Agent Underlying Capabilities: Multimodal Capabilities
Agent Underlying Capabilities: Thinking Capabilities
Xiaoyi Voice Technology
3.2 Alibaba Cloud
Product System
Model Studio Supports Agent Construction
3.3 Baidu Cloud
Product System
Multi-Agent Collaboration Mode
3.4 Tencent Cloud
Product system
Cockpit System Upgrade of TAI 6.0 (1)
Cockpit System Upgrade of TAI 6.0 (2)
Tencent (Inference Service Solution)
Tencent (Generation Scenario Solution)
Q&A Scenario Solution
3.5 ByteDance & Volcano Engine
Doubao Model System
Volcano Engine Cockpit Function Highlights
3.6 SenseTime
Foundation Model System
Model Layout
Cockpit AI Product System
Foundation Model Training Facility
Customers
3.7 Zhipu AI
Agent Evolves to LLM OS
Agent Architecture
Product System
Agent Model
Automotive Foundation Model Base
Technical Highlights
3.8 iFLYTEK
Product system
Functional and Technical Highlights
3.9 Thundersoft
Agent System Is Built Based on Aqua Drive OS
Agent Dynamics
3.10 Kotei Agent
3.11 Lenovo
Agent Architecture
3.12 TINNOVE
Agent Architecture
AI System Service Forms
AI System Application Scenarios

4 Agent Practical Technology

4.1 Intent Recognition
Cases (1)
Cases (2)
4.2 Knowledge Graph and Search
Cases (1)
Cases (2)
Cases (3)
4.3 Emotion Recognition
Cases (1)
Cases (2)
Cases (3)
4.4 Inference Acceleration
Cases (1)
Cases (2)
4.5 Recommendation System
Problems
Patent Technology: Refueling Recommendation
Cases (1)
Cases (2)
4.6 Tool Calling
Synergy and Differences between Function Calling, MCP and A2A
MCP Application Cases (1)
MCP Application Cases (2)
A2A Application Cases
4.7 MAS
Cases (1): Great Wall Motor
Cases (2)
4.8 GUI Agent
Principle
Application
Functions and Features (1)-(4)

5 Problems in Agent Application

Problem 1: The Computing Power Balance Point of the "Edge-Cloud" Deployment
Problem 2: Architecture Design of Multi-Agent Systems (1)
Problem 2: Architecture Design of Multi-Agent Systems (2)
Problem 2: Architecture Design of Multi-Agent Systems (3) - Base Model Selection
Solution: "Intramodal Adaptive Fusion + Cross-modal Precise Interaction" are the Optimal Path for Multimodal Tasks
Case: How Anthropic Designs MAS (1)
Case: How Anthropic Designs MAS (2)
Problem 3: Business Model Design
OEM Agent Profit Model: Cost Types
OEM Agent Profit Model: 7 Cost Recovery Mechanisms at the Current Stage
Lixiang Tongxue Builds IP Ecosystem: Physical Goods
Lixiang Tongxue Builds IP Ecosystem: Virtual Products
NIO's Peripheral Product Sales
Aion Beibi IP Peripheral Products
OEM Agent Profit Model: Future Profit Methods
Cockpit Agent Business Model from the Perspective of OEMs: Tiered Payment
Problem 4: Effectiveness of Scenario Application
Problem 5: Training Bias
Problem 6: Data Privacy

한글목차

목차

제1장 조종석 요원의 현상과 동향

2장 OEM 에이전트 솔루션

제3장 공급업체 에이전트 솔루션

제4장 에이전트 실용 기술

제5장 에이전트 용도 문제