Dongwei Jiang

Research Interest

I am currently working on reinforcement learning and agents, particularly the intersection of these two areas — using RL to train agentic models. Tool integration and the ability to interact with the environment have fundamentally changed what AI systems can accomplish, and RL has emerged as the dominant approach for enhancing these capabilities.

I’m also broadly interested in reasoning. In the realm of reasoning, I’ve worked on:

Building general-purpose verifier through rationale extraction from unlabelled data to provide process supervision during reasoning [1] (mentioned in Lilian Weng's blog)

Investigating the effectiveness of CoT prompting across 100+ papers and 20 datasets and discovering CoT benefits mainly math/symbolic reasoning tasks [2] (discussion with Jason Wei)

Theorem proving and Logical reasoning that uses theorem prover Lean to help with the reasoning process [3]

Decompositional entailment that formulates a consistent and theoretically grounded approach to annotating decompositional entailment dataset [4]

Another area I’m interested in is the self-improvement capability of LLMs (and LLM agents). If we begin with the “end” (superintelligence/AGI) in mind, relying on human input won’t get us there. We need to teach models to interact with the environment and self-improve. Within this area, I’ve worked on:

Understanding the reason that prevents LLM from effective self-improvement [5]

Probing the limits of self-improvement even with high-quality feedback [6]

▶ More about my research

More About Me

Prior to my current role, I spent six years in industry working on speech processing, where I pioneered self-supervised learning approaches for speech (like masked predictive coding [7] and Speech SimCLR [8]) and was among the first to deploy end-to-end ASR systems at production scale. Following the release of ChatGPT, I became deeply interested in foundation models and their potential, which motivated me to return to academia and complete my master’s degree at JHU. There, I worked with Professor Daniel Khashabi and Benjamin Van Durme, and also collaborated with Professor Shay Cohen from Edinburgh and Greg Durrett from NYU on various research projects. Currently, I’m working as an Applied Scientist at Amazon, where I continue to pursue research in foundation models and related areas.

In my free time, I sometimes play Civ 6 or Hearthstone. I also rotate between tennis, badminton, swimming, and bouldering every day—well, more like every three or four days, but who’s counting? Research shows racquet sports can reduce mortality risk by 47% and swimming by 28%—so between all these activities, I’m either achieving immortality or just really bad at math :)

I’ve noticed there’s something puzzle-like about all these activities—whether it’s planning civilizations, crafting the perfect deck, or figuring out a tricky climbing route—which probably explains why I enjoy them alongside my research work.

▶ More about my industrial career

Selected Publications

Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback, NeurIPS

Dongwei Jiang , Alvin Zhang , Andrew Wang , Nicholas Andrews , and Daniel Khashabi

CoRR, 2025

Abstract PDF

Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and reach correct solutions. In this paper, we systematically investigate LLMs’ ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 with extended thinking. Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limita- tion, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We analyze FEEDBACK FRICTION and find that models’ confidence on specific questions, measured by semantic entropy, predicts feedback resistance: high-confidence predictions remain resistant to external correction. We hope that highlighting this issue in LLMs will help future research in self-improvement.
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning, ACL

Dongwei Jiang , Guoxuan Wang , Yining Lu , Andrew Wang , Jingyu Zhang , Chuyu Liu , Benjamin Van Durme , and Daniel Khashabi

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2025

Abstract PDF

The reasoning steps generated by LLMs might be incomplete, as they mimic logical leaps common in everyday communication found in their pre-training data: underlying rationales are frequently left implicit (unstated). To address this challenge, we introduce RATIONALYST, a model for process-supervision of reasoning based on pre-training on a vast collection of rationale annotations extracted from unlabeled data. We extract 79k rationales from web-scale unlabelled dataset (the Pile) and a combination of reasoning datasets with minimal human intervention. This web-scale pre-training for reasoning allows RATIONALYST to consistently generalize across diverse reasoning tasks, including mathematical, commonsense, scientific, and logical reasoning. Fine-tuned from LLaMa-3-8B, RATIONALYST improves the accuracy of reasoning by an average of 3.9% on 7 representative reasoning benchmarks. It also demonstrates superior performance compared to significantly larger verifiers like GPT-4 and similarly sized models fine-tuned on matching training sets.
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning, ICLR

Zayne Rea Sprague , Fangcong Yin , Juan Diego Rodriguez , Dongwei Jiang , Manya Wadhwa , Prasann Singhal , Xinyu Zhao , Xi Ye , Kyle Mahowald , and Greg Durrett

In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, Jul 2025

Abstract PDF

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra “thinking” really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model’s response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT’s gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.
LeanReasoner: Boosting Complex Logical Reasoning with Lean, NAACL

Dongwei Jiang , Marcio Fonseca , and Shay B. Cohen

In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, Jul 2024

Abstract PDF

Large language models (LLMs) often struggle with complex logical reasoning due to logical inconsistencies and the inherent difficulty of such reasoning. We use Lean, a theorem proving framework, to address these challenges. By formalizing logical reasoning problems into theorems within Lean, we can solve them by proving or disproving the corresponding theorems. This method reduces the risk of logical inconsistencies with the help of Lean’s symbolic solver. It also enhances our ability to treat complex reasoning tasks by using Lean’s extensive library of theorem proofs. Our method achieves state-of-the-art performance on the FOLIO dataset and achieves performance near this level on ProofWriter. Notably, these results were accomplished by fine-tuning on fewer than 100 in-domain samples for each dataset.
Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic, EMNLP

Nathaniel Weir , Kate Sanders , Orion Weller , Shreya Sharma , Dongwei Jiang , Zhengping Jiang , Bhavana Dalvi Mishra , Oyvind Tafjord , Peter Jansen , Peter Clark , and Benjamin Van Durme

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abstract PDF

Recent language models enable new opportunities for structured reasoning with text, such as the construction of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what _valid decompositional entailment_ is. This absence causes noisy datasets and limited performance gains by modern neuro-symbolic entailment engines. To address these problems, we formulate a consistent and theoretically grounded approach to annotating decompositional entailment and evaluate its impact on LLM-based textual inference. We find that our new dataset, RDTE (Recognizing Decompositional Textual Entailment), has a substantially higher internal consistency than prior decompositional entailment datasets, suggesting that RDTE is a significant step forward in the long-standing problem of forming a clear protocol for discerning entailment. We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in an entailment tree reasoning engine significantly improves both accuracy and proof quality, illustrating the practical benefit of this advance for textual inference.
SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses, AAAI

Dongwei Jiang , Jingyu Zhang , Orion Weller , Nathaniel Weir , Benjamin Van Durme , and Daniel Khashabi

In Proceedings of the AAAI Conference on Artificial Intelligence, Nov 2025

Abstract PDF

Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.
A Further Study of Unsupervised Pretraining for Transformer Based Speech Recognition, ICASSP

Dongwei Jiang , Wubo Li , Ruixiong Zhang , Miao Cao , Ne Luo , Yang Han , Wei Zou , Kun Han , and Xiangang Li

In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, Nov 2021

Abstract PDF

Building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, many unsupervised pre-training methods have been proposed. Among these methods, Masked Predictive Coding achieved significant improvements on various speech recognition datasets with BERT-like Masked Reconstruction loss and Transformer backbone. However, many aspects of MPC have not been fully investigated. In this paper, we conduct a further study on MPC and focus on three important aspects: the effect of pre-training data speaking style, its extension on streaming model, and how to better transfer learned knowledge from pre-training stage to downstream tasks. Experiments reveled that pre-training data with matching speaking style is more useful on downstream recognition tasks. A unified training objective with APC and MPC provided 8.46% relative error reduction on streaming model trained on HKUST. Also, the combination of target data adaption and layer-wise discriminative training helped the knowledge transfer of MPC, which achieved 3.99% relative error reduction on AISHELL over a strong baseline.
Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-Supervised Speech Representation Learning, InterSpeech

Dongwei Jiang , Wubo Li , Miao Cao , Wei Zou , and Xiangang Li

In 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021, Nov 2021

Abstract PDF

Self-supervised visual pretraining has shown significant progress recently. Among those methods, SimCLR greatly advanced the state of the art in self-supervised and semi-supervised learning on ImageNet. The input feature representations for speech and visual tasks are both continuous, so it is natural to consider applying similar objective on speech representation learning. In this paper, we propose Speech SimCLR, a new self-supervised objective for speech representation learning. During training, Speech SimCLR applies augmentation on raw speech and its spectrogram. Its objective is the combination of contrastive loss that maximizes agreement between differently augmented samples in the latent space and reconstruction loss of input representation. The proposed method achieved competitive results on speech emotion recognition and speech recognition.
Improving Transformer-based Speech Recognition Using Unsupervised Pre-training, arxiv

Dongwei Jiang , Xiaoning Lei , Wubo Li , Ne Luo , Yuxuan Hu , Wei Zou , and Xiangang Li

CoRR, Nov 2019

Abstract PDF

Speech recognition technologies are gaining enormous popularity in various industrial applications. However, building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, an unsupervised pre-training method called Masked Predictive Coding is proposed, which can be applied for unsupervised pre-training with Transformer based model. Experiments on HKUST show that using the same training data, we can achieve CER 23.3%, exceeding the best end-to-end model by over 0.2% absolute CER. With more pre-training data, we can further reduce the CER to 21.0%, or a 11.8% relative CER reduction over baseline.

Dongwei Jiang

LLM agents, reasoning and self-improvement. Previously focused on speech

Applied Scientist at Amazon. Previously master's student at JHU

Research Interest

More About Me

Selected Publications