I’m currently a Master’s degree student (from fall, 2024) at the School of Computer Science of Fudan University and a member of the FudanNLP Lab, advised by Prof. Xuanjing Huang (黄萱菁).Previously, I got my bachelor’s degree from Fudan University, advised by Associate Prof. Tao Gui.

🔥 News

  • 2025.7:  🎉🎉 Our paper on Reward Model Pre-training, POLAR, is now available on !
  • 2024.5:  🎉🎉 One paper on math reasoning & RL was accepted by ICML-2024!
  • 2024.3:  🎉🎉 One paper on in-context learning was accepted by NAACL-2024-Findings!
  • 2023.12:  🎉🎉 One paper on evaluation was accepted by AAAI-2024!

💻 Internships

📝 Publications

Reward Modeling

Arxiv
sym

POLAR: Policy-Discriminative Pre-training for Generalizable Reward Models

Shihan Dou*, Shichun Liu*, Yuming Yang*, Yicheng Zou*, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen
*Equal contributions. †Corresponding authors. ‡Work done during an internship at Shanghai AI Laboratory.

  • Say goodbye to reward models with poor generalization! POLAR (Policy Discriminative Learning) is a groundbreaking pre-training paradigm that trains reward models to distinguish policy distributions, eliminating heavy reliance on human preference data!
  • Highly scalable and tailored for Reinforcement Fine-tuning (RFT)! POLAR assigns rewards based on ground truths, seamlessly integrating into the RFT framework and significantly reducing reward hacking across general tasks!
  • | GitHub Repo stars |
ACL 2025
sym

Lost in the Context: Insufficient and Distracted Attention to Contexts in Preference Modeling

Shihan Dou*, Jiayi Chen*, Chenhao Huang*, Feng Chen, Wei Chengzhi, Huiyuan Zheng, Shichun Liu, Yan Liu, Chenxiao Liu, Chao Xin, Lin Yan, Zongzhang Zhang, Tao Gui, Qi Zhang, Xuanjing Huang

  • The reward model (RM) in RLHF often overlooks crucial context, leading to poor preference alignment. We find that the RM allocates insufficient attention to the context and ignores relevant segments.
  • To address this, we propose AttnRM, a novel optimization framework that directs the RM’s focus to important contextual information. Experimental results show that AttnRM significantly enhances preference modeling, generalizability, and alignment with human preferences.

Reasoning

NAACL 2024 (Findings)
sym

Self-Demos: Eliciting Out-of-Demonstration Generalizability in Large Language Models

Wei He, Shichun Liu, Jun Zhao, Yiwen Ding, Yi Lu, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang.

  • Goal: develop a method that can enhance the generalizability of LLMs when encountering OOD queries, allowing them to better adapt to novel tasks.
  • Through extensive experiments on the tool-using scenario (OOD-Toolset) and mathematical problem-solving tasks (GSM8K and MATH datasets), SELF-DEMOS demonstrated superior performance in handling OOD queries compared to existing state-of-the-art methods.
  • |
ICML 2024
sym

Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning

Zhiheng Xi*, Wenxiang Chen*, Boyang Hong*, Senjie Jin*, Rui Zheng, Wei He, Yiwen Ding, Shichun Liu, Xin Guo, Junzhe Wang, Honglin Guo, Wei Shen, Xiaoran Fan, Yuhao Zhou, Shihan Dou, Xiao Wang, Xinbo Zhang, Peng Sun, Tao Gui, Qi Zhang, Xuanjing Huang

  • We propose R3, a novel method that achieves the benefits of process supervision using only outcome supervision. R3 learns reasoning via a reverse curriculum, progressively moving from easy to hard tasks and enabling precise, step-level feedback.
  • Our method surpasses RL baselines on eight reasoning tasks by 4.1 points on average, and with CodeLlama-7B, it performs comparably to much larger models without extra data.
  • |

Evaluation

Arxiv
sym

EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Shihan Dou, Ming Zhang, Chenhao Huang, Jiayi Chen, Feng Chen, Shichun Liu, Yan Liu, Chenxiao Liu, Cheng Zhong, Zongzhang Zhang, Tao Gui, Chao Xin, Wei Chengzhi, Lin Yan, Qi Zhang, Yonghui Wu, Xuanjing Huang

  • We introduce EvaLearn, a benchmark to evaluate the learning capability of LLMs through sequential problem-solving, where models learn from prior experience.
  • It features 648 problems in 182 sequences and five metrics, revealing that static ability doesn’t always correlate with learning capability, thus offering a new dimension for model evaluation.
  • |
AAAI 2024
sym

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Yue Zhang*, Ming Zhang*, Haipeng Yuan, Shichun Liu, Yongyao Shi, Tao Gui, Qi Zhang, Xuanjing Huang

  • Addresses the crucial “how to evaluate” question for LLMs, analyzing various criteria, scoring methods, and ranking systems.
  • Introduces the LLMEval dataset, based on evaluations of 20 LLMs with over 240,000 manual annotations, and offers 10 key insights for future evaluation.
  • |
Arxiv
sym

A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models

Junjie Ye*, Xuanting Chen*, Nuo Xu, Can Zu, Zekai Shao, Shichun Liu, Yuhan Cui, Zeyang Zhou, Chao Gong, Yang Shen, Jie Zhou, Siming Chen, Tao Gui, Qi Zhang, Xuanjing Huang

  • We analyze the capability evolution of six GPT-3 and GPT-3.5 models on 21 NLU datasets.
  • Our findings reveal that model capabilities do not uniformly improve with evolution, as strategies like RLHF can sometimes compromise performance on specific tasks while enhancing others.

Others

  • Multi-Programming Language Sandbox for LLMs
    Shihan Dou*, Jiazheng Zhang*, Jianxiang Zang*, Yunbo Tao, Weikang Zhou, Haoxiang Jia, Shichun Liu, Yuming Yang, Shenxi Wu, Zhiheng Xi, Muling Wu, Rui Zheng, Changze Lv, Limao Xiong, Shaoqing Zhang, Lin Zhang, Wenyu Zhan, Rongxiang Weng, Jingang Wang, Xunliang Cai, Yueming Wu, Ming Wen, Yixin Cao, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
    |

  • TransferTOD: A Generalizable Chinese Multi-Domain Task-Oriented Dialogue System with Transfer Capabilities
    Ming Zhang*, Caishuang Huang*, Yilong Wu*, Shichun Liu, Huiyuan Zheng, Yurui Dong, Yujiong Shen, Shihan Dou, Jun Zhao, Junjie Ye, Qi Zhang, Tao Gui, Xuanjing Huang

🎖 Honors and Awards

  • 2024.06,the Top Students Award in Computer Science in recognition of his exceptional academic performance in the National Top Talent Undergraduate Training Program.
  • 2022.09, the First prize (Top 0.6% of 49242 teams) of Contemporary Undergraduate Mathematical Contest in Modeling (CUMCM).
  • 2021.12, the Second Prize of the Scholarship for Outstanding Students at Fudan University in the 2020-2021 academic year.
  • 2021.12, the Second Prize Winner(Non-Physics A) in the 38th National Physics Competition for College Students.
  • 2021.12, 2022.12, the Second Prize Winner(Non-Math) in the 13, 14th National Mathematics Competition for College Students.

📖 Educations

  • 2020.9 - 2024.6, B.E. at Fudan University with a major in computer science and technology.

💡Services

  • 2025.1 Reviewer of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025).

🌏 Visitors