- Frontiers of Machine Learning
- Multimodal Large Language Model and Generative AI
- Smart Earth Observation and Remote Sensing Analysis: From Perception to Interpretation
- 3D Imaging and Display
- Forum on Multimodal Sensing for Spatial Intelligence
- Brain-Computer Interface: Frontiers of Imaging, Graphics and Interaction
- Foundation Models for Embodied Intelligence
- Workshop on Machine Intelligence Frontiers: Advances in Multimodal Perception and Representation Learning
- Human-centered Visual Generation and Understanding
- Spatial Intelligence and World Model for the Autonomous Driving and Robotics
- Seminar on the Growth of Women Scientists
- Video and Image Security in the Era of Large Models Forum
The forum on “Spatial Intelligence and World Model for the Autonomous Driving and Robotics” aims to explore innovative breakthroughs and industrial applications of cutting-edge technologies such as spatial intelligence and world models in the fields of autonomous driving and robotics. In recent years, the rapid development of AI technology has fueled surging demand in autonomous driving and robotics. Spatial intelligence has enhanced AI’s ability to perceive and interact with 3D environments, while world models enable AI systems to predict and reason autonomously by simulating the dynamics of real-world environments. The integration of these two technologies holds promise for providing autonomous decision-making and adaptability to embodied intelligent systems such as autonomous driving and robotics.
The core objectives of this forum include: (1) exploring theoretical breakthroughs in world models and spatial intelligence, and developing more efficient 4D spatio-temporal modeling methods; (2) discussing the latest innovative applications in autonomous driving and robotics, including digital twins, scene generation, and agent optimization; (3) promoting technical standardization and industrialization, and driving the coordinated development of industry-academia-research collaboration. The forum brings together global top experts, ultimately accelerating the transition of AI from perception to cognition, reducing the cost of autonomous driving technology implementation, enhancing the adaptability of robots in open environments, driving technological innovation and industrial upgrading, and advancing spatial intelligence and world models toward practical applications.
Schedul
Nov. 2nd 9:30-11:30
Organizer
Jingyi Yu
ShanghaiTech University, Professor
Biography:
Professor Jingyi Yu, OSA Fellow, IEEE Fellow, ACM Distinguished Scientist, and Director of the Ministry of Education Key Laboratory of Intelligent Perception and Human-Computer Collaboration. He received a dual bachelor’s degree from the California Institute of Technology (Caltech) in 2000 and a Ph.D. from the Massachusetts Institute of Technology (MIT) in 2005. He is currently the Deputy Provost, Professor, and Dean of the School of Information Science and Technology at ShanghaiTech University. Professor Yu has been engaged in research in the fields of computer vision, computational imaging, computer graphics, and bioinformatics, and has received the NSF Career Award from the U.S. National Science Foundation. In the field of intelligent light field research, he holds over ten international PCT patents, which have been widely applied in smart cities, digital humans, and human-computer interaction scenarios. He also serves as an editorial board member for top-tier journals such as IEEE TPAMI and IEEE TIP, and as program chair for multiple international artificial intelligence conferences (ICCP 2016, ICPR 2020, WACV 2021, CVPR 2021, ICCV 2025). He is a member of the World Economic Forum (WEF) “Global Agenda Council” and serves as the Curator for the Metaverse direction.
Xin Jin
Eastern Institute of Technology, Ningbo, Assistant Professor
Biography:
Xin Jin is an assistant professor and doctoral supervisor at Eastern Institute of Technology, Ningbo, and a member of the Zhejiang Province Young Elite Talent Program. His research covers the cutting-edge fields of intelligent media and computer vision. He has published over 40 high-quality papers, which have been cited over 5,000 times on Google Scholar. Many of his research findings have been applied in products from companies such as Microsoft, Alibaba, and Geely Automobile, generating significant economic value. He has received the President's Special Award from the Chinese Academy of Sciences, the IEEE Circuits and Systems Society's Second Visual Signal Processing and Communications Rising Star Award, and is recognized as one of the top 2% of scientists globally by Stanford University. He serves as a committee member of the IEEE VSPC, the Multimedia Special Interest Group of the CSIG, the Embodied Intelligence Special Interest Group of the CAAI, and the Executive Committee of VALSE. He has organized tutorials and workshops related to spatial representation learning, embodied intelligence, and generative technologies at top conferences such as CVPR, ICCV, NeurIPS, ACMMM, and ECCV.
Presenters
Li Zhang
Fudan University, Professor
Biography:
Professor and Doctoral Supervisor at the School of Data Science, Fudan University, and Full-Time Mentor at the Shanghai Innovation Institute. Recipient of support from the National Youth Talent Program. Earned a Ph.D. from the Department of Electronic Engineering and Computer Science at Queen Mary University of London. Previously served as a Postdoctoral Researcher at the Department of Engineering Science, University of Oxford, and a Research Scientist at the Cambridge Samsung AI Center. Recipient of the Shanghai Overseas High-Level Talent Program, the Shanghai Science and Technology Youth 35 Leading Talent Program (35U35), Elsevier China Highly Cited Scholar, and the World Artificial Intelligence Conference Young Excellent Paper Award; published over 90 papers in international artificial intelligence journals and conferences such as IEEE TPAMI, IJCV, and NeurIPS, with a total of over 20,000 citations. He serves as an area chair for the international artificial intelligence conferences NeurIPS 2023, NeurIPS 2024, NeurIPS 2025, CVPR 2023, CVPR 2024, and CVPR 2025, and as an associate editor for the journal Pattern Recognition.
Speech Title: Trustworthy World Engine
Abstract:This report systematically introduces the research work of the research group in the field of 3D/4D reconstruction and generation, with a focus on how to establish precise mesh extraction, real-time rendering, and dynamic 3D scene representation. exploring gradient propagation path optimization in generative models to enhance the quality of high-resolution 3D object generation, and leveraging video-generated priors to optimize 3D models under free trajectories, thereby achieving accurate material and lighting estimation based on PBR technology. Based on this, a new trustworthy simulation engine is constructed, featuring: multi-modal realistic scene rendering; support for closed-loop evaluation to accommodate free-form trajectory behavior; provision of highly diverse dynamic scenes for comprehensive evaluation; support for multi-agent collaboration to account for interactive dynamics; and high computational efficiency to ensure cost-effectiveness and scalability. Additionally, the report will discuss the application of 3D understanding in cutting-edge fields such as embodied intelligence.
Ye Shi
ShanghaiTech University, Researcher
Biography:
Dr. Ye Shi is currently an Assistant Professor, Researcher, and PhD Supervisor at the School of Information Science and Technology, ShanghaiTech University, and the Director of the YesAI Trustworthy and General Intelligence Laboratory. In recent years, he has published over 70 papers in top conferences and journals (NeurIPS, ICML, ICLR, CVPR, ICCV, TNNLS, TSG, etc.). His research focuses on controllable, robust, and secure artificial intelligence theories, algorithms, and applications, with a systematic exploration of the theoretical foundations of controllable diffusion models and their applications in embodied intelligence. Dr. Ye Shi serves as Area Chair for NeurIPS 2025 and organizes the Human-Computer Interaction and Collaboration Workshop at ICCV 2025. He has been selected for the Shanghai Overseas Leading Talent Program and the Shanghai Yangfan Program, and has secured a National Natural Science Foundation grant. He has received the National Outstanding Overseas Student Award, the Outstanding Paper Award at the Generative Theory Workshop at ICLR 2025, and the Best Paper Award at IEEE ICCSCE 2016.
Speech Title:Reconstructing Embodied Intelligence Theory and Algorithm Systems Based on Diffusion Models
Abstract:This report summarizes the team’s research achievements in the field of diffusion model-driven embodied intelligence. Theoretically, two major innovations are proposed: DSG establishes a loss-guided error lower bound theory through spherical Gaussian constraints, achieving zero training cost for manifold constraint acceleration; the UniDB framework unifies diffusion bridge methods based on stochastic optimal control, revealing universal laws for which traditional methods are special cases. At the algorithm level, two reinforcement learning engines are developed: QVPO unifies exploration and exploitation in an off-policy reinforcement learning framework; and GenPO establishes the first diffusion-driven on-policy reinforcement learning paradigm. In terms of validation, cross-domain generalization is achieved: AffordDP combines visual models with point cloud registration to achieve cross-category generalization; DreamPolicy achieves zero-shot generalization on complex terrains for humanoid robots through terrain-aware diffusion. These achievements establish a complete technical chain from generative modeling to decision control, providing theoretical, practical, and efficient solutions for embodied intelligence.
Jingya Wang
ShanghaiTech University, Researcher
Biography:
Dr. Jingya Wang is currently a Researcher, Assistant Professor, and PhD Supervisor at the School of Information Science and Technology, ShanghaiTech University. Her research interests focus on human-centered 3D interaction and embodied intelligence. She has published over 50 papers in top-tier conferences and journals in computer vision, including over 40 papers in CCF-A class journals. She has served as Area Chair for conferences such as CVPR, NeurIPS, ICML, ICCV, ECCV, and ACM MM. She has been selected for the Shanghai Overseas Leading Talent Program and the Shanghai Yangfan Program, and has led projects such as the National Natural Science Foundation of China. She received the 2018 CVPR Doctoral Consortium Award, and her first-authored paper was selected as one of the “Best of CVPR Papers” by Computer Vision News Magazine in 2018. In 2023, she was included in Baidu's AI Chinese Women Young Scholars List. She has been nominated for the Best Paper Award at the 2024 ACM Design Automation Conference and the Best Paper Award at the 2024 ACM Multimedia Conference.
Speech Title:Interaction and Embodied Intelligence for Spatial Intelligence
Abstract:One of the main challenges in embodied intelligence research lies in the scarcity of data and the high cost of collecting real-world data. How to effectively utilize vast amounts of online data and prior knowledge of human interactions in the real world to guide robot learning has become a key issue. In this report, we will delve into how to extract high-precision 3D motion interaction information in open environments to enhance the robustness of full-body motion control in humanoid robots. We will also share our research findings on 3D interaction extraction, 3D interaction reasoning, 3D interaction simulation, and their applications in intelligent decision-making. We will explore how to enhance embodied intelligence’s generalization capabilities in open worlds by learning from vast amounts of human interaction prior knowledge, thereby improving its performance in new environments and with new objects.
Li Jiang
The Chinese University of Hong Kong, Shenzhen, Assistant Professor
Biography:
Li Jiang is an Assistant Professor in the School of Data Science and a Presidential Young Scholar at The Chinese University of Hong Kong, Shenzhen. She received her Ph.D. from The Chinese University of Hong Kong in 2021 and subsequently worked as a postdoctoral researcher at the Max Planck Institute. Her research focuses on computer vision and artificial intelligence, specifically in areas such as 3D scene understanding, autonomous driving, spatial intelligence, world models, representation learning, and multimodal learning. Her work has been published in top-tier conferences and journals including CVPR, ICCV, ECCV, NeurIPS, TPAMI, and IJCV, with several papers selected for oral presentations and highlights. Her Google Scholar citation count exceeds 12,000. Her research on motion prediction for autonomous driving achieved first place in the CVPR Waymo Open Dataset Motion Prediction Challenge for three consecutive years (2022-2024). She was named to the 2024 World's Top 2% Scientists list by Stanford University and Elsevier for annual impact, as well as the "Top 50 Global Female AI Talent List" jointly released by UNIDO-ITPO (Beijing) and DONGBI DATA. She has also received funding from national-level young talent programs.
Speech Title:World Models Based on Omni Scene Modeling for Autonomous Driving
Abstract:As autonomous driving systems increasingly demand generalization, comprehension, and prediction capabilities, building a world model with universal world knowledge has become critically important. This presentation will introduce our recently proposed self-supervised general world model, DriveX, designed to construct generalized and predictive world knowledge representations for autonomous driving systems. The core of DriveX is the Omni Scene Modeling (OSM) module, which unifies geometric structures, semantic information, and visual details into a Bird's-Eye View (BEV) latent space through self-supervision, forming a unified and spatially aware world representation. To enhance modeling quality and transferability, DriveX employs a decoupled learning strategy that effectively separates representation learning from future state modeling, improving dynamic scene understanding. Additionally, the proposed future spatial attention mechanism dynamically aggregates the world model’s future predictions and adapts them to various downstream tasks, such as planning and prediction, in a unified paradigm. In the future, we plan to further explore the role of world models in autonomous driving data generation and closed-loop simulation, advancing their potential in building controllable, safe, and efficient intelligent driving systems.