I am an Algorithm Researcher at Li Auto, where I joined through the "Li Auto +" Top Talent Program within the Base Model Team. I received my Master's degree from the Academy for Engineering and Technology, Fudan University in 2025, supervised by Prof. Lihua Zhang (National Thousand Talents Program) at the Cognition and Intelligent Technology Laboratory (CIT Lab). Prior to this, I gained valuable research experience as an Algorithm Intern at Tencent Youtu Lab (May 2025 – Oct 2025), mentored by Ke Li (Principal Researcher). I holds a B.Eng. in Robotics Engineering from Hohai University, where I conducted research at the Jiangsu Province Special Robot Laboratory (2018-2022) under the guidance of Prof. Kang Xia and Prof. Tingting Wang.

I have authored 9 papers as the first or co-first author, published in international conferences or available on arXiv, and published 15+ papers as co-author at top international conferences such as NeurIPS, AAAI, CVPR, and so on.

My current research interests mainly focus on Multimodal LLM, LLM, and Agents. If you are interested in my research or would like to have an academic discussion, welcome to contact me via email!

🎖 Honors and Scholarships

BYD Master Student Scholarship (Top 1%), 2024

Excellent Student (Top 5%), 2024

Huatai Securities Technology Scholarship (Top 1%), 2024

CICAI Finalist of Best Student Paper Award, 2023

Outstanding Academic Scholarship, 2023,2024

Outstanding Graduate of Fudan University, 2025 (Top 5%)

Outstanding Master's Thesis of Fudan University, 2025 (Top 5%)

🔈 Academic Service

Reviewer: ICML, ICLR, NeurIPS, AAAI, ACM MM, ICME, MICCAI, JBI, CIBM.

📝 Selected Publications

Equal contribution $^\star$ Corresponding author $^\dagger$

Large Language/Visual Models

ICASSP 2025

MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Yue Jiang $^\star$ , Jiawei Chen $^\star$ , ..., Lihua Zhang $^\dagger$

We introduce MedThink, a novel medical construction method that effectively mitigates hallucinations in LVLMs within the medical domain.

NeurIPS 2024

PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Dingkang Yang, ..., Shuaibin Wang $^\star$ , Jiawei Chen $^\star$ , ..., Peng Zhai $^\dagger$ , Lihua Zhang $^\dagger$

This paper builds PedCorpus, a high-quality dataset of over 300,000 multi-task instructions from pediatric textbooks, guidelines, and knowledge graph resources to fulfil diverse diagnostic demands. Upon well-designed PedCorpus, we propose PediatricsGPT, the first Chinese pediatric LLM assistant built on a systematic and robust training pipeline.

ACML 2024

Large Vision-Language Models as Emotion Recognizers in Context Awareness

Yuxuan Lei, Dingkang Yang $^\dagger$ , Zhaoyu Chen, Jiawei Chen ..., Lihua Zhang $^\dagger$

We systematically explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task from three paradigms: 1) We fine-tune LVLMs on CAER datasets, which is the most common way to transfer large models to downstream tasks. 2) We design a training-free framework to exploit the In-Context Learning (ICL) capabilities of LVLMs. 3) To leverage the rich knowledge base of LVLMs, we incorporate Chain-of-Thought (CoT) into our framework to enhance the reasoning ability and provide interpretable results.

Submitted to ICLR 2025

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Jiawei Chen $^\star$ , Dingkang Yang $^\star$ , Tong Wu, ..., Lihua Zhang $^\dagger$

We introduce the first benchmark dedicated to hallucination detection in the medical domain, Med-HallMark, and provide baselines for various LVLMs. We propose the first hallucination detection model, MediHallDetector, and demonstrate its superiority through extensive experiments. We present a new hallucination evaluation metric, MediHall Score, and show its effectiveness relative to traditional metrics through qualitative and quantitative analysis.

ACM MM 2024

Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models

Jiawei Chen, Dingkang Yang, Yue Jiang, ..., Lihua Zhang $^\dagger$

We are the first to centre on finetuning a small subset of the Med-VLP's inherent parameters to adapt to downstream tasks. We conduct a comprehensive series of experiments finetuning foundational components of Med-VLMs, including systematic comparisons with existing PEFT methods centred on tuning extrinsic components.

Submitted to ICLR 2025

PEG: Towards Robust Text Retrieval with Progressive Learning

Tong Wu, Yulei Qin, Jiawei Chen, ..., Lihua Zhang $^\dagger$

We propose the Progressively learned textual EmbeddinG (PEG) for robust text retrieval. Specifically, we increase the number of negative samples per training batch to 80,000, with each query paired with at least five hard negatives via offline mining. Concurrently, we incorporate a progressive learning mechanism to enable the model to dynamically modulate its attention to the samples throughout training. Extensive experiments on C-MTEB and DuReader demonstrate that PEG surpasses state-of-the-art embedding models in retrieving true positives, highlighting its significant potential for applications in LLMs.

MICCAI 2024

Can LLMs' Tuning Methods Work in Medical Multimodal Domain?

Jiawei Chen, Yue Jiang, ..., Lihua Zhang $^\dagger$

We delve into the fine-tuning methods of LLMs and conduct extensive experiments to investigate the impact of fine-tuning methods for large models on existing multimodal models in the medical domain from the training data level and the model structure level.

ICANN 2024

MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

Jiawei Chen, Dingkang Yang, Yue Jiang ..., Lihua Zhang $^\dagger$

We propose an efficient MultI-task Self-Supervised-learning-based framework (MISS) for medical VQA tasks. Unlike existing methods, we treat medical VQA as a generative task. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning.

AIGC

AAAI 2025

BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation

Xiaolu Hou $^\star$ , Mingcheng Li $^\star$ , Dingkang Yang, Jiawei Chen ..., Lihua Zhang $^\dagger$

We propose BloomScene, a lightweight structured 3D Gaussian Splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead.

ACML 2024

SceneWeaver: Text-Driven Scene Generation with Geometry-aware Gaussian Splatting

Xiaolu Hou, Mingcheng Li, Jiawei Chen..., Lihua Zhang $^\dagger$

We propose a two-stage geometry-aware progressive scene generation framework, SceneWeaver, which creates diverse and high-quality 3D scenes from text or image inputs. In the first stage, we introduce a multi-level depth refinement mechanism that incrementally inpaints and updates 3D point clouds based on 2D pixels to construct high-quality initial point clouds of the scene. In the second stage, 3D Gaussian points are initialized based on the point cloud and continuously optimized...