- Frontiers of Machine Learning
- Multimodal Large Language Model and Generative AI
- Smart Earth Observation and Remote Sensing Analysis: From Perception to Interpretation
- 3D Imaging and Display
- Forum on Multimodal Sensing for Spatial Intelligence
- Brain-Computer Interface: Frontiers of Imaging, Graphics and Interaction
- Foundation Models for Embodied Intelligence
- Workshop on Machine Intelligence Frontiers: Advances in Multimodal Perception and Representation Learning
- Human-centered Visual Generation and Understanding
- Spatial Intelligence and World Model for the Autonomous Driving and Robotics
- Seminar on the Growth of Women Scientists
- Video and Image Security in the Era of Large Models Forum
Driven by the rapid development of artificial intelligence, “human-centered” visual generation and understanding has become an important direction for the development of intelligent vision systems. With breakthroughs in large-scale language and visual models, human-centered foundation models are emerging, aiming to unify diverse human-centered tasks within a general framework, transcend the limitations of traditional task-specific approaches, and drive progress in cutting-edge applications such as digital humans and human-like avatars. This paradigm shift not only introduces new demands for fundamental theories and core technologies but also poses profound challenges for interdisciplinary collaboration and ethical governance. The ability to generate and interpret visual content demonstrates great potential in digital human modeling, virtual interaction, and intelligent creation, while also confronting practical issues such as data diversity, individual differences, and privacy and security. How to design model architectures with greater generalization and adaptability, explore cognitive mechanisms in multimodal fusion, and achieve efficient and trustworthy visual generation and understanding has become a central focus for both academia and industry. With the theme of “Human-Centered Visual Generation and Understanding”, this forum invites four distinguished experts and scholars from academia to engage in in-depth discussions on fundamental theories, key technologies, application scenarios, and ethical challenges. Through this intellectual exchange, we aim to inspire new research directions, foster cross-domain collaboration, and inject sustained innovation into the development of intelligent vision systems in the era of human-computer integration.
Schedul
Nov. 2nd 13:30-15:30
Organizer
Min Cao
Soochow University, Associate Professor
Biography:
Min Cao is an Associate Professor at Soochow University. She has been recognized as an Outstanding Young Backbone Teacher under the Jiangsu “Qinglan Project” and awarded the “Double Innovation Ph.D.” talent title of Jiangsu Province. She currently serves as Secretary-General of the Professional Committee of the Jiangsu Artificial Intelligence Society, Deputy Secretary-General of the Computer Vision Technical Committee of the Shanghai Computer Society, and Executive Committee Member of CCF Suzhou. Her research interests focus on vision-language multimodal learning. She has published over 30 papers in leading conferences and journals in this field, and holds more than 10 authorized patents. She has presided over projects funded by the National Natural Science Foundation of China (General Program and Youth Program), the Jiangsu Provincial Higher Education Basic Research Program, and several national and provincial key laboratory open projects. Dr. Cao has received multiple academic honors and awards, including the Second Prize of Excellent Academic Paper in Natural Science of Suzhou (2022–2023) and the Second Prize of Suzhou Artificial Intelligence Natural Science Award (2024).
Lei Zhu
Tongji University, Research Professor
Biography:
Lei Zhu is a Research Professor and Ph.D. supervisor at Tongji University, a nationally recognized young talent, and Assistant Dean of the School of Computer Science and Technology. He also serves as the Director of the Center for Large Models and Algorithms at the Institute of Intelligent Engineering, Tongji University. His research focuses on efficient multimodal large models and spatial intelligence. He has published over 100 papers in top-tier journals and conferences, including Proceedings of the IEEE and IEEE TPAMI. He has authored two English monographs, with more than 11,000 Google Scholar citations and an H-index of 57. His work has received two Best Paper Nomination Awards at CCF-A conferences, and one paper was selected among China’s Top 100 Most Influential International Academic Papers. Prof. Zhu serves as an Associate Editor for ACM TOMM and IEEE TBD, as well as SPC/AC for multiple CCF-A conferences. He is also the Deputy Secretary-General of the Young Professionals Committee of CSIG. He has led or participated in more than 10 national and provincial research projects, including grants from the National Natural Science Foundation of China. He has received several prestigious awards, including the Second Prize of Shandong Natural Science Award and the Wu Wenjun Artificial Intelligence Natural Science Award (Second Prize). He has been recognized as a top scientist, listed in the Stanford Top 2% Global Scientists and ScholarGPS Top 0.05% Worldwide Scientists. He has also been nominated as one of the AI 2000 Most Influential Scholars in Multimedia. Under his supervision, his students have received Outstanding Ph.D. Dissertation Awards at the provincial level and Excellent Master’s Thesis Awards from leading academic societies.
Presenters
Guiguang Ding
Tsinghua University, Professor
Biography:
Guiguang Ding is a tenured Professor at Tsinghua University and Party Secretary of the Beijing National Research Center for Information Science and Technology. He is a recipient of the National Science Fund for Distinguished Young Scholars (Category A, extended funding). His research addresses the practical demands of computer vision technologies in areas such as national public security, online content management, autonomous driving, robotics, and intelligent manufacturing. His work focuses on visual perception and understanding, visual model architecture design, and model compression and optimization. He has also led large-scale applications in multimedia content processing and industrial defect detection, developing efficient visual perception computing systems and platforms. Professor Ding has presided over dozens of major projects, including National Key R&D Programs and National Natural Science Foundation key projects. He has published more than 100 academic papers, with over 26,000 citations on Google Scholar. His research outcomes have been successfully applied in leading enterprises such as Kuaishou, OPPO, JD.com, and Lingyun Optics. He has received numerous prestigious awards, including the Second Prize of the State Scientific and Technological Progress Award, the First Prize of the Technical Invention Award of the Chinese Institute of Electronics, and the First Prize of the Wu Wenjun Artificial Intelligence Science and Technology Progress Award.
Speech Title:Research on Inference Optimization Techniques for Multimodal Large Models
Abstract:With the rapid development of large-scale model technologies, model complexity and parameter size are growing dramatically, posing great challenges for efficient deployment. The efficient design and compression optimization of backbone networks has become one of the key directions in artificial intelligence research. Achieving effective deployment of deep learning models under limited hardware resources is fundamental to enabling large-scale AI applications, and is crucial for advancing the use of large models in edge-side scenarios such as smartphones and autonomous driving. This talk introduces structural re-parameterization methods for the design of deep learning backbone networks. Guided by this methodology, a new vision backbone network RepViT and a real-time object detection model YOLOv10—both suitable for edge deployment—are presented. In addition, compression and optimization methods for the prefill and decoding stages of multimodal large models will be discussed.
Jianhuang Lai
Sun Yat-sen University, Professor
Biography:
Jianhuang Lai is a Second-Level Professor and Ph.D. supervisor at the School of Computer Science, Sun Yat-sen University. He serves as Vice President and Fellow of the China Society of Image and Graphics, and has been President of the Guangdong Society of Image and Graphics (4th and 5th terms). He is a Distinguished Member of the China Computer Federation (CCF), former Vice Chair of the CCF Computer Vision Technical Committee (1st and 2nd terms), Vice President of the Guangdong Society for Artificial Intelligence and Robotics (1st term), and Chair of the AI Committee of the Guangdong Security Association. He received his B.Sc. (1986) and M.Sc. (1989) degrees from Sun Yat-sen University, where he subsequently joined the faculty, and obtained his Ph.D. there in 1999. His research focuses on computer vision, pattern recognition, and machine learning. He has led numerous projects funded by the National Natural Science Foundation of China, Guangdong Key Joint Projects, and the Ministry of Science and Technology. Prof. Lai has received multiple awards, including the First Prize of Natural Science from the Guangdong Provincial Science and Technology Award (2018, rank 1), the Second Prize of Technological Progress from Guangdong Province (2016, rank 3), the Ding Ying Award (2019), and is a recipient of the State Council Special Government Allowance. He has published approximately 200 academic papers in top conferences such as ICCV, CVPR, ICDM, and in leading journals including IEEE TPAMI, IEEE TIP, IEEE TNN, IEEE KDE, and Pattern Recognition.
Speech Title:Recent Advances in Person Re-Identification: From Individual Pedestrians to Small Groups, and from Ground-Based to Integrated Ground-Air Perspectives
Abstract:Person re-identification (ReID) aims to correctly associate images of the same pedestrian captured by non-overlapping camera views, which has significant research value and practical applications in public security surveillance and intelligent safety systems. In recent years, this task has evolved toward larger crowd scales and more dynamic scenes. The research paradigm has gradually expanded from individual pedestrian identification to small-group recognition, and from traditional ground-based fixed surveillance to an integrated ground-air collaborative perception framework. This transformation has attracted widespread attention and driven rapid development in the field. This talk will present the scientific challenges and research progress in person ReID, including the efforts of our lab in individual, small-group, and ground-air integrated pedestrian ReID. Key contributions include: Drone-based pedestrian ReID methods leveraging self-attention models for rotational representation, salient local alignment, and keypoint decoupling; Robust group feature extraction across views using salient keypoints, Siamese networks, uncertainty modeling, 3D group layout reconstruction, and collaborative strategies; Group-level metrics based on pedestrian-group distance and nearest permutation distance; Cross-domain ReID techniques leveraging pedestrian-group associations; Construction of large-scale datasets including the virtual reality group dataset City1M, the cross-modal group image dataset CMGroup, and the group video dataset VVIG.
Shiguang Shan
Institute of Computing Technology, Chinese Academy of Sciences, Research Professor
Biography:
Shiguang Shan is a Research Professor and Ph.D. supervisor at the Institute of Computing Technology, Chinese Academy of Sciences (ICT), and a member of the Institute’s Administrative Committee. He is an IEEE Fellow and currently serves as Director of the Intelligent Information Processing Laboratory and Deputy Director of the National Key Laboratory of Intelligent Algorithms and Security. He has been recognized as a Leading Talent in the “Ten Thousand Talents Program” for Scientific Innovation, a recipient of the National Science Fund for Distinguished Young Scholars, and a key expert of the National “Hundred, Thousand, Ten Thousand Talents Program.” He also holds the Special Government Allowance from the State Council and was the first recipient of the Tencent Science Exploration Award. Additionally, he received the CCF Young Scientist Award. Prof. Shan’s research focuses on artificial intelligence, particularly in image recognition, object detection, AI security, and affective computing. His papers have been cited over 44,000 times according to Google Scholar. His research outcomes have been widely applied and have earned multiple awards, including the Second Prize of the National Science and Technology Progress Award (2005), the Second Prize of the National Natural Science Award (2015), and the First Prize of the CSIG Natural Science Award (2022, 2024). He has served as Vice Chair of the Young Professionals Committee of the China Computer Federation (CCF), and Vice Chair of the Pattern Recognition and Affective Computing Technical Committees of the Chinese Association for Artificial Intelligence (CAAI).
Speech Title: Advances and Trends in Vision-Based Unobtrusive Psychological Measurement
Abstract:Accurate measurement of human emotions and psychological states has significant theoretical and practical value in fields such as healthcare and safety. Current mainstream psychometric methods rely primarily on text-based questionnaires, which are subjective and prone to deception, limiting their suitability for broader theoretical development and practical applications. In response, intelligent unobtrusive psychological measurement techniques leveraging visual, auditory, and other remote sensing modalities combined with artificial intelligence have emerged. This talk will present recent advances in this field, with a particular focus on Prof. Shan’s research team’s work in vision-based emotion and health perception. Key topics include facial expression recognition, facial action unit detection, gaze and eye interaction, and heart rate estimation. Preliminary applications in assessing children with autism spectrum disorder will also be discussed, highlighting the potential of these techniques for real-world deployment.
Jianjun Qian
Nanjing University of Science and Technology, Professor
Biography:
Jianjun Qian is a Professor and Ph.D. supervisor at Nanjing University of Science and Technology, and serves as Vice Chair of the Pattern Recognition Technical Committee of the Jiangsu Artificial Intelligence Society. His research focuses on pattern recognition and visual computing, as well as human-centered embodied perception. He has published over 100 papers in leading journals and conferences, including IEEE TPAMI, IEEE TIP, IEEE TNNLS, IJCV, Pattern Recognition, CVPR, AAAI, and ACM MM. Prof. Qian has led three projects funded by the National Natural Science Foundation of China and a key project under the Jiangsu Basic Research Program, and has participated in multiple projects such as the JW Science & Technology Committee Basic Enhancement Program. He has received two First Prizes of the Jiangsu Science and Technology Award (as second and fourth contributor). He has also been selected for the National Young Talent Program, the “Xiangjiang Scholar Program,” and recognized as an Outstanding Young Backbone Teacher under Jiangsu’s “Qinglan Project.”
Speech Title: Face Perception: From the Outside In
Abstract:Face perception plays a crucial role in human-centered embodied intelligent systems, where actively sensing physiological signals and emotional states provides important support for tasks such as health monitoring and affective interaction. These capabilities have broad applications in smart homes, digital healthcare, and intelligent service systems. This talk will present research on extracting internal information from external facial cues, including physiological signal perception (heart rate, respiration rate, and body temperature) and multimodal emotion analysis. The methods and findings highlight the potential of face-based perception for enabling effective embodied interaction in real-world applications.