Foundation Models for Embodied Intelligence

| Workshops

With continuous breakthroughs in multimodal understanding and generation, foundation models are driving artificial intelligence from cognition in virtual spaces toward perception and action in the physical world. Embodied intelligence, as a key direction in this transformation, integrates multimodal sensory capabilities such as vision, audition, and touch, enabling agents to understand environments, interact with humans, and accomplish complex tasks. It is increasingly recognized as a critical pathway toward artificial general intelligence. However, current foundation models for embodied intelligence still face significant challenges: limited generalization across entities, tasks, and environments; a pronounced gap between simulation training and real-world deployment; and underdeveloped systems for safety and interpretability. Addressing these issues requires deep, cross-disciplinary collaboration and innovation.

This forum, themed “Foundation Models for Embodied Intelligence,” brings together leading experts and scholars from robotics, computer vision, natural language processing, reinforcement learning, and related fields. Discussions will focus on building core capabilities, overcoming technical bottlenecks, and exploring future trends in embodied intelligence. The forum aims to refine key research directions, foster interdisciplinary collaboration, and promote the establishment of technical standards and evaluation frameworks, ultimately accelerating the transition of embodied intelligence foundation models from research to real-world applications.

Schedul

Nov. 2nd 13:00-15:00

Organizer

Wengang Zhou

University of Science and Technology of China, Professor

Biography:
Dr. Wengang Zhou is a Professor and Ph.D. advisor at the School of Information Science and Technology, University of Science and Technology of China (USTC), and a recipient of the National Excellent Young Scientists Fund. His research interests include multimedia information retrieval, computer vision, and game theory. He has published over 100 papers in leading IEEE journals (such as TPAMI, TIP, and TMM) and CCF-A international conferences (CVPR, ACM MM, AAAI), with more than 20,000 citations on Google Scholar. He serves as an Associate Editor and Lead Guest Editor of IEEE Transactions on Multimedia. He has received the Wu Wenjun AI Science and Technology Progress First Prize, the Young Scientist Award from the China Society of Image and Graphics, the CAS Excellent Doctoral Dissertation Award, the CAS Excellent Mentor Award, and has been selected as an Outstanding Member of the Youth Innovation Promotion Association of CAS and the “Young Elite Scientists Sponsorship Program” of CAST.

Jiajun Deng

National University of Singapore, Research Fellow

Biography:

Dr. Jiajun Deng is a Postdoctoral Research Fellow at the National University of Singapore. He studied at the University of Science and Technology of China from 2012 to 2021, receiving his B.Eng. (2016) and Ph.D. (2021) degrees. His research focuses on multimodal understanding and spatial intelligence, with long-term work in 3D scene object detection, 3D reconstruction, vision–language understanding, and multi-sensor fusion. He has published over 50 high-quality papers in venues such as IEEE TPAMI, NeurIPS, CVPR, ICCV, and ECCV. He was recognized in the 2024 Stanford University & Elsevier list of the “World’s Top 2% Scientists.” He serves as a Guest Editor for IEEE Transactions on Multimedia and Area Chair for ACM Multimedia.

Presenters

Weishi Zheng

Sun Yat-sen University, Professor

Biography:
Dr. Weishi Zheng is a Professor at the School of Computer Science, Sun Yat-sen University, a Distinguished Professor under the Changjiang Scholars Program of the Ministry of Education, IAPR Fellow, and a Newton Advanced Fellow of the Royal Society (UK). He currently serves as the Associate Dean of the School of Computer Science at Sun Yat-sen University and Director of the Key Laboratory of Machine Intelligence and Advanced Computing (Ministry of Education). His research focuses on machine learning and AI applications. He has published more than 200 papers in top-tier journals and conferences (CCF-A / SCI Q1 / Nature sub-journals), including over 30 in IEEE T-PAMI, IJCV, SIGGRAPH, and Nature Communications. He serves as an editorial board member of IEEE T-PAMI and other leading journals. He has led five national-level projects, including NSFC Major Research Plan Key Projects and NSFC Joint Fund Key Projects, and the Guangdong Provincial Outstanding Youth Team project. His awards include the First Prize of Natural Science from CSIG, the First Prize of Guangdong Natural Science Award, and the Second Prize of National Teaching Achievement Award.

Speech Title： Embodied Perception and Learning for Dexterous Robotic Hand Manipulation

Abstract：Achieving human-like dexterous grasping is a key goal of embodied intelligence. In this domain, we present three contributions. First, we propose the DGTR framework, which formulates grasp generation as a set prediction problem and introduces a two-stage progressive training strategy, enabling stable and diverse grasp prediction. Second, to enable language-driven dexterous hand grasping, we propose the DexGYSGrasp framework, which allows robots to learn language-conditioned grasp distributions and generate high-quality grasps aligned with human intent. Most recently, we introduce AffordDexGrasp, which leverages object affordances for zero-shot generalization, enabling robots to perform language-guided dexterous grasps on unseen object categories.

Shanghang Zhang

Peking University, Assistant Professor

Biography:
Dr. Shanghang Zhang is a Researcher and Ph.D. advisor at the School of Computer Science, Peking University, a Boya Young Scholar, and a ZhiYuan Scholar. She received her Ph.D. from Carnegie Mellon University in 2018 and conducted postdoctoral research at UC Berkeley. Her research focuses on open-world generalizable machine learning theory and systems. She has published over 120 papers in top AI conferences and journals, with 17,000+ Google Scholar citations. She received the AAAI 2021 Best Paper Award. She is the author of the Springer Nature book Deep Reinforcement Learning, with nearly 300,000 downloads worldwide, and was named among the “Annual High-Impact Research Highlights” by Chinese authors. She has been recognized as an EECS Rising Star, among the Global Chinese Women AI Scholars, a member of CAST’s “Youth 100” program, and an AI100 Young Pioneer. She won first place in the International Multimodal Brain Response Prediction Competition and the ICCV Continual Generalization Challenge. She has organized workshops at NeurIPS and ICML and served multiple times as a Senior PC member for AAAI.

Speech Title：Open-World Multimodal Foundation Models for Embodied Intelligence

Abstract：Recent advances in foundation models and embodied intelligence have led to remarkable progress. However, embodied agents in real-world open environments face significant challenges in generalization across entities, scenes, and tasks. This talk will present a series of studies on embodied multimodal foundation models, with a particular focus on large-scale foundation models for embodied intelligence, including embodied brain-scale models and end-to-end large models. The talk will also discuss efforts in building large-scale datasets for embodied intelligence.

Jiangmiao Pang

Shanghai AI Laboratory, Research Scientist

Biography:
Dr. Jiangmiao Pang is a Young Scientist at Shanghai AI Laboratory and Head of the Embodied Intelligence Center. His research focuses on robot learning, multimodal learning, and embodied intelligence, with the goal of building a unified and generalizable embodied AGI system. He has published more than 60 papers in top-tier venues including TPAMI, IJCV, CVPR, and CoRL, with over 14,000 Google Scholar citations. His open-source projects have accumulated more than 45,000 GitHub stars and are widely used in both academia and industry. He has received honors including ECCV 2024 Best Paper Nomination, RSS 2025 Best System Paper Nomination, and Most Influential Paper Awards at CVPR 2023 and ECCV 2024.

Speech Title： Intern Robotics: Bookworm Embodied Full-Stack Engine and Key Technologies

Abstract：Embodied intelligence has recently drawn great attention and progress, but challenges remain, including unclear task definitions, insufficient and low-diversity data, poor generalization, and difficulties in evaluation. This talk introduces the Intern Robotics embodied full-stack engine and its key technologies developed at Shanghai AI Laboratory. The system integrates a simulation engine, data engine, and training-evaluation engine, providing foundation models and dedicated toolchains for embodied intelligence. It supports training embodied foundation models that generalize across entities, tasks, and scenes via seamless virtual-real integration, empowering heterogeneous robots to serve diverse industries.

Xin Jin

Eastern Institute of Technology, Ningbo, Assistant Professor

Biography:
Dr. Xin Jin is an assistant professor and doctoral supervisor at Eastern Institute of Technology, Ningbo, and a member of the Zhejiang Province Young Elite Talent Program. His research covers the cutting-edge fields of intelligent media and computer vision. He has published over 40 high-quality papers, which have been cited over 5,000 times on Google Scholar. Many of his research findings have been applied in products from companies such as Microsoft, Alibaba, and Geely Automobile, generating significant economic value. He has received the President's Special Award from the Chinese Academy of Sciences, the IEEE Circuits and Systems Society's Second Visual Signal Processing and Communications Rising Star Award, and is recognized as one of the top 2% of scientists globally by Stanford University. He serves as a committee member of the IEEE VSPC, the Multimedia Special Interest Group of the CSIG, the Embodied Intelligence Special Interest Group of the CAAI, and the Executive Committee of VALSE. He has organized tutorials and workshops related to spatial representation learning, embodied intelligence, and generative technologies at top conferences such as CVPR, ICCV, NeurIPS, ACMMM, and ECCV.

Speech Title： Exploration and Application of Spatial Intelligence Technology in Autonomous Driving and Robotics

Abstract：Spatial intelligence technology is a new generation of artificial intelligence paradigm based on 3D visual information for environmental understanding, reasoning, generation, and interaction. Its core lies in endowing machine systems with the unified ability to accurately perceive, efficiently make decisions, and act autonomously in a dynamic three-dimensional world. As a key link between artificial intelligence and the physical world, it is reshaping the technical paradigms and business models of the autonomous driving and robotics.

第十三届国际图象图形学学术会议(ICIG2025)

Contact Us

你知道你的Internet Explorer是过时了吗?