소프트웨어 2.0과 엔드투엔드 기술이 자율주행에 탑재되면서 지능형 주행의 개발 모델은 규칙 기반 서브태스크 모듈에서 데이터 구동 단계의 AI 2.0으로 진화하여 범용 인공지능(AGI), 즉 AI 3.0을 향해 점차 발전하고 있습니다. 발전하고 있습니다.
SenseAuto는 Auto China 2024에서 차세대 자율주행 기술인 DriveAGI의 프리뷰를 공개했으며, DriveAGI는 엔드투엔드 지능형 주행 솔루션의 개선과 업그레이드를 위한 대규모 멀티모달 모델을 기반으로 합니다. DriveAGI는 자율주행 기반 모델을 데이터 구동형에서 인지 구동형으로 진화시켜 운전자의 개념을 넘어 세상을 더 잘 이해하고, 더 높은 추론, 판단, 대화 능력을 갖춘 멀티모달 모델을 기반으로 합니다. 자율주행에서는 현재 인간의 사고 패턴에 가장 가깝고, 인간의 의도를 가장 잘 이해하고, 어려운 운전 시나리오에 대처할 수 있는 능력이 가장 높은 기술 솔루션입니다.
데이터 클로즈드 루프는 AI 1.0 이후 자율주행 연구개발에 필수적인 요소이지만, 자율주행의 AI 적용 단계에 따라 데이터 클로즈드 루프의 각 링크에 대한 요구사항도 크게 달라집니다.
지능형 주행 시스템의 풀스택 모델 개발은 데이터 폐쇄 루프에 어떤 변화를 가져올 것인가?
데이터의 흐름에서 볼 때, 현재 지능형 주행 데이터 수집 방법에는 특수 수집 차량 수집, 양산 차량 데이터 수집 및 백홀, 도로 데이터 수집 및 통합, 저공 비행 드론 교통 데이터 수집, 시뮬레이션 합성 데이터 등 다양한 방법이 있습니다. 커버리지, 가장 일반화된 시나리오, 가장 완벽한 데이터 유형을 달성하여 궁극적으로 데이터의 수량, 완전성, 정확도라는 세 가지 요소를 충족시키는 것을 목표로 하고 있습니다. 현재 양산 차량에 의한 데이터 수집이 주류를 이루고 있습니다.
OEM은 대량 생산된 자동차를 통해 대량의 지능형 운전 데이터를 지속적으로 축적하여 AI 알고리즘을 훈련할 수 있는 효과적이고 고품질의 데이터를 추출하고 있습니다. 예를 들어 Li Auto는 80만 명 이상의 자동차 소유자의 운전 행동을 점수화했으며, 그 중 약 3%는 90점 이상을 받아 "베테랑 운전자"라고 불립니다. 차량 소유자의 숙련된 운전 데이터는 엔드-투-엔드 모델을 훈련시키는 연료가 되며, 2024년 말까지 Li Auto의 엔드-투-엔드 모델은 500만km 이상의 주행 거리를 학습할 것으로 예상했습니다.
그렇다면, 충분한 데이터가 있다면 어떻게 효과적인 장면 데이터를 완벽하게 추출하고 더 나은 품질의 훈련 데이터를 채굴할 수 있는가? 다음 예시를 통해 알 수 있습니다.
데이터 압축 측면에서 볼 때, 차량이 수집하는 데이터는 대부분 차량 시스템 및 각종 센서의 환경 인식 데이터에서 비롯된 경우가 많습니다. 분석 및 모델 훈련에 사용하기 전에 데이터의 품질과 일관성을 보장하기 위해 데이터를 엄격하게 전처리하고 정리해야 합니다. 차량 데이터는 다양한 센서와 장비에서 수집되며, 각 장비마다 고유한 데이터 형식이 있으며, RAW 형식(즉, ISP 알고리즘으로 처리되지 않은 원시 카메라 데이터)으로 저장된 고해상도 지능형 주행 장면 데이터는 향후 고품질 장면 데이터의 동향이 될 것으로 보입니다. Vcarsystem의 '카메라 기반 RAW 데이터 압축 및 수집 솔루션'은 데이터 수집의 효율성을 향상시킬 뿐만 아니라 원시 데이터의 무결성을 극대화하여 후속 데이터 처리 및 분석에 신뢰할 수 있는 기반을 제공합니다. 기존의 ISP 후 압축 데이터 재생과 비교하여 RAW 압축 데이터 재생은 ISP 처리 과정에서 정보 손실을 방지하고 원시 이미지 데이터를보다 정확하게 복원 할 수 있으며 알고리즘 훈련의 정확성과 지능형 주행 시스템의 성능을 향상시킬 수 있습니다.
데이터 마이닝과 관련해서는 오프라인 3D 점군 기초 모델을 기반으로 한 데이터 마이닝 사례가 주목할 만합니다. 예를 들어 QCraft는 오프라인 점군 기반 모델을 기반으로 고품질 3D 데이터를 마이닝하여 물체 인식 능력을 지속적으로 향상시킬 수 있습니다. 뿐만 아니라 QCraft는 텍스트-이미지 기반의 혁신적인 멀티모달 모델도 구축하고 있습니다. 이 모델은 자연 언어 텍스트 설명만으로 모니터링 없이 해당 장면 이미지를 자동으로 획득하고, 일반적인 데이터 활용으로는 찾기 어렵고, 일상에서 접하기 어려운 롱테일 장면을 많이 채굴할 수 있으며, 롱테일 장면의 채굴 효율을 향상시킬 수 있습니다. 예를 들어 '밤중에 빗속을 달리는 대형 트럭', '길가에 쓰러진 사람'과 같은 텍스트 설명이 입력되면 해당 장면을 자동으로 피드백하여 특별한 분석과 학습을 지원할 수 있습니다.
중국의 자동차 산업에 대해 조사분석했으며, 자율주행 데이터 클로즈드 루프의 개발에 관한 정보를 제공하고 있습니다.
Data closed loop research: as intelligent driving evolves from data-driven to cognition-driven, what changes are needed for data loop?
As software 2.0 and end-to-end technology are introduced into autonomous driving, the intelligent driving development model has evolved from the rule-based sub-task module to the data-driven stage AI 2.0, and is gradually developing towards artificial general intelligence (AGI), namely, AI 3.0.
At the Auto China 2024, SenseAuto showcased its next-generation autonomous driving technology: preview of DriveAGI, which is based on large multimodal models for improvement and upgrade of end-to-end intelligent driving solutions. DriveAGI is the evolution of autonomous driving foundation models from data-driven to cognition-driven, beyond the concept of driver, deepening understanding of the world, and boasting greater reasoning, decision and interaction capabilities. In autonomous driving, it is currently the technical solution that is closest to human thinking patterns, can understand human intentions best, and has the strongest ability to cope with difficult driving scenarios.
Data closed loop is indispensable to autonomous driving R&D after AI 1.0, but at different stages of AI application in autonomous driving, the requirements for each link of data closed loop vary greatly.
What changes will the full-stack model development of intelligent driving systems bring to the data closed loop?
From the perspective of data flow, there are currently many ways to collect intelligent driving data, including collection by special collection vehicles, data collection and backhaul by production vehicles, roadside data collection and fusion, traffic data collection by drones at low altitudes, and simulated synthetic data, in a bid to achieve the maximum coverage, the most generalized scenarios, and the most complete data types, and ultimately fulfill the three elements of data: mass, completeness, and accuracy. Wherein, data collection by production vehicles is the mainstream mode.
As can be seen from the above table, OEMs keep accumulating massive amounts of intelligent driving data with production vehicles, and extracting effective and high-quality data to train AI algorithms. For example, Li Auto has scored the driving behaviors of more than 800,000 car owners, about 3% of which are scored above 90 and can be called "experienced drivers." The driving data of the experienced drivers of fleets is the fuel for training end-to-end models. It is estimated that by the end of 2024, Li Auto's end-to-end model is expected to learn over 5 million kilometers.
So, with sufficient enough data, how can we fully extract effective scene data and mine higher-quality training data? You can get to know from the following examples:
In terms of data compression, the data collected by vehicles often comes from the environmental perception data of vehicle systems and various sensors. Before being used for analysis or model training, the data must be preprocessed and cleaned strictly to ensure its quality and consistency. The vehicle data may come from different sensors and devices, and each device may have its own specific data format. High-definition intelligent driving scene data stored in RAW format (i.e., raw camera data that has not been processed by the ISP algorithm) will become a trend of high-quality scene data in the future. In Vcarsystem's case, its "camera-based RAW data compression and collection solution" not only improves the efficiency of data collection, but also maximizes the integrity of the raw data, providing a reliable foundation for subsequent data processing and analysis. Compared with the traditional ISP post- compressed data replay, RAW compressed data replay avoids the information loss in the ISP processing process, and can restore the raw image data more accurately, improving the accuracy of algorithm training and the performance of the intelligent driving system.
As for data mining, data mining cases based on offline 3D point cloud foundation models deserve attention. For example, based on offline point cloud foundation models, QCraft can mine high-quality 3D data and continuously improve object recognition capabilities. Not only that, QCraft has also built an innovative multimodal model based on text to image. Just with natural language text descriptions, the model can automatically retrieve corresponding scene images without supervision and mine many long-tail scenes that are difficult to find in ordinary data use and hard to encounter in life, thereby improving the efficiency of mining long-tail scenes. For example, as text descriptions such as "a large truck traveling in the rain at night" and "people lying at the roadside" are inputted, the system can automatically give a feedback on the corresponding scene, favoring special analysis and training.
As foundation models find broad application and deep learning technology advances, the demand for data labeling makes explosive growth. The performance of foundation models depends heavily on the quality of input data. So the requirements for the accuracy, consistency, and reliability of data labeling become increasingly higher. To meet the high demand for data labeling, many data labeling companies have begun to develop automatic labeling functions to further improve data labeling efficiency. Examples include:
Based on the automation capabilities of foundation models, DataBaker Technology has launched 4D-BEV, a new labeling tool which supports the processing of hundreds of millions of pixel point clouds. It helps to quickly and accurately perceive and understand the surroundings of the vehicle, and combines static and dynamic perception tasks for multi-perspective, multi-sequential labeling of objects such as vehicles, pedestrians and road signs, providing more accurate information like object location, speed, posture and behavior. It can also provide interactive information of different objects in the scene, helping the autonomous driving system to better understand the traffic conditions on the road, so as to make more accurate decisions and control. To improve the efficiency and accuracy of labeling, DataBaker Technology adds machine vision algorithms to 4D-BEV to automatically complete complex labeling work, enabling high-quality recognition of lane lines, curbs, stop lines, etc.
MindFlow's SEED data labeling platform supports all types of 2D, 3D, and 4D labeling in autonomous driving and other scenarios, including 2/3D fusion, 3D point cloud segmentation, point cloud sequential frame overlay, BEV, 4D point cloud lane lines and 4D point cloud segmentation, and covers all labeling sub-scenarios of autonomous driving. In addition, its AI algorithm labeling model incorporates AI intelligent segmentation based on the SAM segmentation model, static road adaptive segmentation, dynamic obstacle AI preprocessing, and AI interactive labeling. It improves the average efficiency of data labeling in typical autonomous driving scenarios by more than 4-5 times, and by more than 10-20 times in some scenarios. In addition, MindFlow's data labeling foundation model is based on weak supervision and semi-supervised learning, and uses a small amount of manually labeled data and a mass of unlabeled data for efficient detection, segmentation, and recognition of scene objects.
Additionally, on July 27, 2024, NIO officially announced NWM (NIO World Model), China's first intelligent driving world model. As a multivariate autoregressive generative model, it can fully understand information, generate new scenes, and predict what may happen in the future. It is worth noting that as a generative model, NWM can use a 3-second driving video as Prompt to generate a 120-second video. Through the self-supervision process, NWM can need no data labeling and becomes more efficient.
High-level intelligent driving needs to be tested in various complex and diverse scenarios, which requires not only high precision sensor perception and restoration capabilities, but also powerful 3D scene reconstruction capabilities and scene coverage generalization capabilities.
PilotD Automotive's full physical-level sensor model can simulate detailed physical phenomena, for example, multi-path reflection, refraction, interference and multi-path reflection of electromagnetic waves, or dynamic sensor performance such as detection loss rate, object resolution and measurement inaccuracy, and "ghost" physical phenomena, so as to obtain high fidelity required by the sensor model. The full physical-level sensor model based on PilotD Automotive's PlenRay physical ray technology currently boasts a simulation restoration rate of over 95%.
dSPACE's AURELION (high-precision simulation of 3D scenes and physical sensors) is a flexible sensor simulation and visualization software solution. Based on physical rendering by a game engine, it simulates pixel-level raw data of camera sensors. AURELION's radar module uses ray tracing technology to simulate the signal-level raw data of ray-type sensors. Considering the impacts of specific materials on LiDAR, the output point cloud contains reflectivity values close to real calculations. For each ray, it provides realistic motion distortion effects and configurable time offset values.
RisenLighten's Qianxing Simulation Platform adds rich and realistic pedestrian models, and supports customization of micro trajectories of pedestrians and batch generation of pedestrians. Moreover, the platform also provides different high-fidelity pedestrian behavior style models, covering such scenarios as human-vehicle interaction, crossing, and diagonal crossing at intersections. It models three types of drivers (conservative, conventional and aggressive), and refines parameters by probability distribution, so as to diversify and randomize driving behaviors of vehicles in the environment.
As a generative simulation model, NIO NSim can compare each trajectory deduced by NWM with the corresponding simulation results. Originally they could only be compared with the only trajectory in the real world. Yet adding NSim enables joint verification in tens of millions of worlds, providing more data for NWM training. This makes the output intelligent driving trajectory and experience safer, more reasonable, and more efficient.
In the field of autonomous driving, end-to-end solutions have a more urgent need of high-fidelity scenes. For the end-to-end system needs to cope with various complex scenarios, a lot of videos labeled with autonomous driving behaviors need to be put into autonomous driving training. With regard to 3D scene reconstruction, currently penetration and application of 3D Gaussian Splattering (3DGS) technology in the automotive industry accelerate. This is because 3DGS performs well in rendering speed, image quality, positioning accuracy, etc., fully making up for the shortcomings of NeRF. Meanwhile the reconstructed scene based on 3DGS can replicate the edge scenes (Corner Case) found in real intelligent driving. By dynamic scene generalization, it improves the ability of the end-to-end intelligent driving system to cope with corner cases. Examples include:
51Sim innovatively integrates 3DGS into traditional graphics rendering engines through AI algorithms, making breakthroughs in realism. 51Sim fusion solution has high-quality and real-time rendering capabilities. The high-fidelity simulation scene not only improves the training quality for the autonomous driving system, but also significantly improves the authenticity of simulation, making it almost indistinguishable to naked eyes, greatly improving the confidence of simulation, and making up for shortfalls of 3DGS in details and generalization capabilities.
In addition, Li Auto also uses 3DGS for simulation scene reconstruction. Li Auto's intelligent driving solution consists of three systems, namely, end-to-end (fast system) + VLM (slow system) + world model. Wherein, the world model combines two technology paths: reconstruction and generation. It uses 3DGS technology to reconstruct the real data, and the generative model to offer new views. In scene reconstruction, the dynamic and static elements are separated, the static environment is reconstructed, and the dynamic objects are reconstructed and a new view is generated. After re-rendering the scene, a 3D physical world is formed, in which the dynamic assets can be edited and adjusted arbitrarily for partial generalization of the scene. The generative model features greater generalization ability, and allows weather, lighting, traffic flow and other conditions to be customized to generate new scenes that conform to real laws, which are used to evaluate the adaptability of the autonomous driving system in various conditions.
In short, the scene constructed by combining reconstruction and generation creates a better virtual environment for learning and testing the capabilities of the autonomous driving system, enabling the system to have efficient closed-loop iteration capabilities and ensuring the safety and reliability of the system.
The data closed loop is divided into the perception layer and the planning and control layer, both of which have an independent closed loop process. In both aspects, data closed loop technology providers have the ability to improve their service capabilities, for example:
In terms of perception, in the project development process, the version of the autonomous driving system will be released regularly, integrating and packaging all the contents such as perception, planning and control, communication, and middleware. Some intelligent driving solution providers such as Nullmax will release the perception part separately first, and then test it through automatic tools and testers, output specific reports, and evaluate the fixing of the problems at the early stage. If there are problems with the perception version, there is still time to continue to modify and test it. This can greatly avoid the upstream perception problems from affecting the entire system, and is more conducive to problem location and system improvement, greatly improving the efficiency of system release and project development.
In terms of planning and control, in QCraft's case, its self-developed "joint spatio-temporal planning algorithm" takes into account both space and time to plan the trajectory, and solves the driving path and speed in three dimensions simultaneously, rather than solve the path separately first and then solve the speed based on the path to form the trajectory. Upgrading "horizontal and vertical separation" to "horizontal and vertical combination" means that both path and speed curves will be used as variables in the optimization problem to obtain the optimal combination of the two.
Data closed-loop technology providers generally provide complete data closed-loop solutions or separate data closed-loop products (i.e. modular tool services, e.g., annotation platform, replay tool and simulation tool) for OEMs and Tier1s. OEMs with great data governance capabilities often outsource tool modules that they are not good at, and integrate them into their own data processing platform systems; while OEMs with weak data governance capabilities will consider tightly coupled data closed-loop products or customized services, for example, FUGA, Freetech's new-generation tightly coupled data closed-loop platform product, has gathered more than 8 million kilometers of real mass production data, and experience in algorithm closed-loop iteration of over 100 production models, achieving more than 100-fold algorithm iteration efficiency and managing over 3,000 sets of high-value scene data fragments per month. At present, FUGA has been deployed and applied in production vehicle projects of multiple leading OEMs, supporting daily test data problem analysis, and weekly data cleaning and statistical report analysis.