Qinyuan Cheng

Hi! I am a forth-year PhD student at Fudan University, advised by Prof. Xipeng Qiu. I received my bachelor's degree from Sun Yat-sen University, advised by Prof. Hanjiang Lai. I am currently interning at Shanghai AI Laboratory as a LLM researcher (2023.5-Now). Previously, I worked as a software development engineer intern at ByteDance (2021.3-2021.8).

My research interests include post-training of large language models, including alignment (truthfullness and safety), reinforcement learning for LLMs (reasoning), multimodal large models (vision and speech), interpretability (mechanistic interpretability), and LLM-based agent systems (RAG and task-oriented dialogue systems). Reach out to me over email: chengqy2019@foxmail.com.

CV  /  Google Scholar  /  Github  /  X (twitter)  /  OpenMOSS

profile photo
My Timeline
  • [Feb. 2025] New Paper: We revisited the test-time scaling capabilities of current open-source o1-like models and proposed a method combining parallel sampling and sequential revisions for better test-time scaling.
  • [Jan. 2025] We released SpeechGPT-2.0-preview, a GPT-4o-level, real-time spoken dialogue system.
  • [Dec. 2024] We released a survery paper about o1's roadmap, which has attracted widespread attention in the global AI community. It's a teamwork with Zhiyuan Zeng, Zhangyue Yin and Bo Wang.
  • [Sep. 2024] Two papers accepted to COLING 2025!
  • [Sep. 2024] Four papers accepted to EMNLP 2024!
  • [Jun. 2024] We released SIUO benchmark. We first revealed that safe single-modal content may lead to unsafe responses when cross-modal inputs are involved.
  • [Jun. 2024] We proposed Inference-Time Decontamination to reuse leaked benchmarks for LLM evaluations.
  • [Jun. 2024] We proposed Unified Active Retrieval (UAR) for RAG, a unified light-weight framework to address different active scenarios in RAG systems. This work is cited by Google DeepMind
  • [May. 2024] One paper accepted to ACL 2024 (Findings) and one paper accepted to COLING 2024!
  • [May. 2024] One paper accepted to ICML 2024! We systematically studied truthfulness alignment (including SFT, DPO, BoN, PPO and HIR) of LLMs and proposed a framework to align LLMs' knowledge boundaries.
  • [Mar. 2024] Excited to announce OpenMOSS! It's an open platform to share research from the MOSS team.
  • [Jan. 2024] I led a team to develop a LLM-based agent system in collaboration with HONOR, which has been integrated into Magic OS 8.0 to help users create videos from their photo albums.
  • [Dec. 2023] I gave a talk on Evaluating Hallucinations in Chinese LLMs at NICE.
  • [Oct. 2023] We released HalluQA, a high-quality adversarial hallucination benchmark for Chinese LLMs. This work is cited by OpenAI and Google DeepMind
  • [Aug. 2023] I organized a LLM post-training competition about improving Chinese LLMs' intelligence, truthfulness, and safety (total prize of 1,000,000 RMB).
  • [May. 2023] I joined Shanghai AI Laboratory and participated in the post-training of InternLM1.
  • [May. 2023] We proposed CLAIF and CLHAIF, the first work to use AI feedback to provide scalable training signals for Contrastive Learning and combine AI feedback with human feedback for futher improvement.
  • [Feb. 2023] We are excited to release MOSS, a conversational language model. I was mainly responsible for data synthesis in the post-training stage, including multi-turn dialogue, safety alignment, honest alignment, etc.
  • [Nov. 2022] One paper accepted to AAAI 2023!
  • [Oct. 2022] My first paper is accepted to EMNLP 2022 (Findings)! We construct an User Simulator for task-oriented agents and use SFT + RL to fine-tune t5 models. Based on these, we propose an interactive evaluation framework for TOD agents and use reward-shaping to improve the quality of the generated responses.
  • [Sep. 2022] I gave a talk on diffusion models at FNLP. Document can be viewed here.
  • [Sep. 2021] I joined the Fudan NLP Lab at Fudan University as a master student.
Highlighted Papers

Full list of papers can be found at Google Scholar

(*: Equal contribution)

Can AI Assistants Know What They Don't Know?
Qinyuan Cheng*, Tianxiang Sun*, Xiangyang Liu, Wenwei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, Xipeng Qiu
ICML 2024  
pdf / github / blog on OpenMOSS

We ask the question "Can AI assistants know what they don't know and express them through natural language?" To answer this question, we construct a model-specific "I don't know" (Idk) dataset for an assistant, which contains its known and unknown questions, based on existing open-domain question answering datasets. Then we design a series of truthfulness alignment methods to align the assistant with its corresponding Idk dataset and observe whether it can refuse to answer its unknown questions after alignment.

Evaluating Hallucinations in Chinese Large Language Models
Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, Xipeng Qiu
Preprint 2023, cited by OpenAI and Google DeepMind  
pdf / github / blog

We introduce HalluQA, a benchmark with 450 adversarial questions to assess hallucinations in Chinese large language models, covering diverse domains and cultural contexts. It targets imitative falsehoods and factual errors, built using GLM-130B and ChatGPT, and evaluated automatically with GPT-4. Testing 24 models like ERNIE-Bot and Qwen, 18 scored below 50% non-hallucination rates, showing its difficulty. We analyze hallucination types, causes, and prioritization for different models.

Improving Contrastive Learning of Sentence Embeddings from AI Feedback
Qinyuan Cheng, Xiaogui Yang, Tianxiang Sun, Linyang Li, Xipeng Qiu
ACL 2023 Findings  
pdf / github / blog

We propose Contrastive Learning of sentence embeddings from AI Feedback (CLAIF) to enhance contrastive learning in NLP. Unlike typical methods struggling with sample pair quality due to language's discrete nature, CLAIF uses AI feedback from large language models to create fine-grained similarity scores for sample pairs. Combining this with human feedback, it improves supervised contrastive learning. Experiments show CLAIF outperforms other methods on semantic textual similarity and transfer learning tasks.

Unified Active Retrieval for Retrieval Augmented Generation
Qinyuan Cheng*, Xiaonan Li*, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, Xipeng Qiu
EMNLP 2024 Findings, cited by Google DeepMind  
pdf / github / blog

We introduce Unified Active Retrieval (UAR) to improve Retrieval-Augmented Generation (RAG) by addressing inefficiencies in existing active retrieval methods. UAR uses four orthogonal criteria as simple classification tasks for better retrieval timing, with minimal extra cost. Supported by UAR-Criteria, it handles diverse instructions effectively. Experiments show UAR outperforms others in accuracy and downstream tasks.

Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator
Qinyuan Cheng*, Linyang Li*, Guofeng Quan, Feng Gao, Xiaofeng Mou, Xipeng Qiu
EMNLP 2022 Findings  
pdf / github

We propose an interactive evaluation framework for Task-Oriented Dialogue (TOD) systems to address the policy mismatch in current methods, where user utterances don’t align with varied system responses. Our approach uses a goal-oriented user simulator built on pre-trained models to generate dialogues, introducing sentence- and session-level scores for fluency and coherence. Experiments show RL-based TOD systems trained with our simulator achieve 98% inform and success rates on MultiWOZ, with the new scores enhancing response quality assessment.

Cross-modality safety alignment
Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, Xuanjing Huang
NAACL 2025 Findings  
pdf / Project Page

We present the Safe Inputs but Unsafe Output (SIUO) challenge to assess cross-modality safety in Artificial General Intelligence (AGI), where individually safe modalities can combine to produce unsafe outputs. Unlike prior studies on single-modality risks, SIUO targets complex interactions across 9 safety domains, including self-harm and privacy violations. Our benchmark tests reveal significant vulnerabilities in models like GPT-4V and LLaVA, highlighting their limitations in handling real-world scenarios safely.

Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
Zhengfu He, Xuyang Ge, Qiong Tang, Tianxiang Sun, Qinyuan Cheng, Xipeng Qiu
arXiv, 2402.12201  
pdf / blog on OpenMOSS

Sparse dictionary learning has been a rapidly growing technique in mechanistic interpretability to attack superposition and extract more human-understandable features from model activations. We ask a further question based on the extracted more monosemantic features: How do we recognize circuits connecting the enormous amount of dictionary features? We propose a circuit discovery framework alternative to activation patching.

Projects & Resources

SpeechGPT-2.0-preview
Mainly responsible for text data curation (including pre-training and post-training).

We introduce SpeechGPT 2.0-preview, our first human-like real-time interaction system, trained on millions of hours of Chinese speech data. This end-to-end model offers low-latency, natural responses with human-like expressions, supporting interruptions and multi-style emotional control. It excels in role-playing, vocal talents like poetry and dialects, and integrates text capabilities for tool use and searches. Currently, it supports only Chinese, with no English training yet.

Survey Paper: Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective
First co-author.

We analyze OpenAI's o1, an AI milestone excelling in reasoning tasks, driven by reinforcement learning. Unlike alternatives like knowledge distillation, limited by teacher models, our roadmap focuses on four key components: policy initialization for human-like reasoning, reward design for effective guidance, search for high-quality solutions, and learning to enhance performance with more data and parameters. These elements highlight how learning and search power o1's success, influencing LLM development.

MOSS: A Conversational Language Model
Mainly responsible for data synthesis in the post-training stage, including multi-turn dialogue, safety alignment, honest alignment.

MOSS is a conversational language model like ChatGPT. It is capable of following users' instructions to perform various natural language tasks including question answering, generating text, summarzing text, generating code, etc. MOSS is also able to challenge incorrect premises, and reject inappropriate requests. Here is a brief introduction to MOSS.

Service

Reviewer / Program Committee Member

  • ACL (2023, 2024)
  • EMNLP (2022, 2023, 2024)
  • COLING (2025)
  • NAACL (2025)
  • ICML (2025)
  • ICLR (2025)
  • NeurIPS (2024, 2025)
  • AISTATS (2025)

Design and source code from Jon Barron's website