Dongwei Jiang

Research Interest

I am broadly interested in reasoning. In the realm of reasoning, I’ve worked on complicated problems like:

Theorem proving and Logical reasoning that uses theorem prover Lean to help with the reasoning process [1]
Decompositional entailment that formulates a consistent and theoretically grounded approach to annotating decompositional entailment dataset [2]

However, one puzzling limitation in LLM reasoning is that while these models can solve “superhuman” problems in specific domains, they often fail at simple tasks. This observation led me to question whether LLMs’ problem-solving abilities truly demonstrate superior reasoning capabilities or simply reflect domain-specific overspecialization. As a result, my focus has now turned to the more general, system-2-like reasoning. My work in this area includes:

Building general-purpose verifier through rationale extraction from unlabelled data to provide process supervision during reasoning [3]
Investigating the effectiveness of CoT prompting across 100+ papers and 20 datasets and discovering CoT benefits mainly math/symbolic reasoning tasks [4]

I’m also interested in the self-improvement capability of LLMs. If we begin with the “end” (superintelligence/AGI) in mind, relying on human input won’t get us there. We need to teach models to interact with the environment and self-improve. Specifically, I’ve worked on:

Understanding the reason that prevents LLM from effective self-improvement [5]
Probing the limits of self-improvement even with high-quality feedback [6]

I believe these two research directions are deeply interconnected and can synergistically enhance each other. Strong reasoning capabilities are essential for effective self-improvement, as models need to logically analyze and discriminate between good and bad generations to provide meaningful feedback. Conversely, self-improvement mechanisms are crucial for advancing reasoning capabilities, as complex logical problems often require multiple attempts and refinements to reach the correct solution. This bidirectional relationship suggests that advancing either area could create positive feedback loops that benefit both capabilities.

In addition, my research has frequently drawn inspiration from cognitive science concepts, including cognitive load, system 2 reasoning, and zone of proximal development. This connection seems natural, given that LLMs are fundamentally trained to emulate human cognitive patterns. I would love to explore this intersection more deeply in future research.

More About Me

In my past life, I spent six years working in the industry on speech processing and speech foundation models. Recently, my focus has shifted to LLMs. To that end, I’m currently studying at JHU as a master’s student, working with Professor Daniel Khashabi and Benjamin Van Durme. I’ve also worked with Professor Shay Cohen from Edinburgh and Greg Durrett from UT Austin.

My industrial career has been marked by a series of devastating external events that fundamentally disrupted the companies where I worked. At DiDi, I was developing speech processing systems when the company was hit by severe regulatory action from the Chinese government, forcing its delisting from the New York Stock Exchange and creating massive operational uncertainty. At YuanFuDao, my work was abruptly affected when the Double Reduction Policy essentially crippled the core business model of educational technology companies throughout China. Later at Shopee, I was advancing speech technologies when the combination of global economic downturn and US-China tensions triggered an 80% stock price collapse and extensive layoffs throughout the company. These successive corporate disruptions necessitated my transitions between roles, as each company faced existential challenges that made continuing my technical work there untenable.

In my free time, I sometimes play Civ 6 or Hearthstone. I also run and go bouldering every other day - well, more like every three or four days, but who’s counting? I’ve noticed there’s something puzzle-like about all these activities—whether it’s planning civilizations, crafting the perfect deck, or figuring out a tricky climbing route—which probably explains why I enjoy them alongside my research work.

Selected Publications

arxiv

Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback

Dongwei Jiang , Alvin Zhang , Andrew Wang , Nicholas Andrews , and Daniel Khashabi

arXiv preprint, 2025

Abstract PDF

Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs’ ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.
ACL

RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

Dongwei Jiang , Guoxuan Wang , Yining Lu , Andrew Wang , Jingyu Zhang , Chuyu Liu , Benjamin Van Durme , and Daniel Khashabi

arXiv preprint, 2024

Abstract PDF

The reasoning steps generated by LLMs might be incomplete, as they mimic logical leaps common in everyday communication found in their pre-training data: underlying rationales are frequently left implicit (unstated). To address this challenge, we introduce RATIONALYST, a model for process-supervision of reasoning based on pre-training on a vast collection of rationale annotations extracted from unlabeled data. We extract 79k rationales from web-scale unlabelled dataset (the Pile) and a combination of reasoning datasets with minimal human intervention. This web-scale pre-training for reasoning allows RATIONALYST to consistently generalize across diverse reasoning tasks, including mathematical, commonsense, scientific, and logical reasoning. Fine-tuned from LLaMa-3-8B, RATIONALYST improves the accuracy of reasoning by an average of 3.9% on 7 representative reasoning benchmarks. It also demonstrates superior performance compared to significantly larger verifiers like GPT-4 and similarly sized models fine-tuned on matching training sets.
ICLR

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Zayne Sprague , Fangcong Yin , Juan Diego Rodriguez , Dongwei Jiang , Manya Wadhwa , Prasann Singhal , Xinyu Zhao , Xi Ye , Kyle Mahowald , and Greg Durrett

arXiv preprint, 2024

Abstract PDF

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra “thinking” really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model’s response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT’s gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.
NAACL

LeanReasoner: Boosting Complex Logical Reasoning with Lean

Dongwei Jiang , Marcio Fonseca , and Shay B. Cohen

In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, 2024

Abstract PDF

Large language models (LLMs) often struggle with complex logical reasoning due to logical inconsistencies and the inherent difficulty of such reasoning. We use Lean, a theorem proving framework, to address these challenges. By formalizing logical reasoning problems into theorems within Lean, we can solve them by proving or disproving the corresponding theorems. This method reduces the risk of logical inconsistencies with the help of Lean’s symbolic solver. It also enhances our ability to treat complex reasoning tasks by using Lean’s extensive library of theorem proofs. Our method achieves state-of-the-art performance on the FOLIO dataset and achieves performance near this level on ProofWriter. Notably, these results were accomplished by fine-tuning on fewer than 100 in-domain samples for each dataset.
EMNLP

Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic

Nathaniel Weir , Kate Sanders , Orion Weller , Shreya Sharma , Dongwei Jiang , Zhengping Jiang , Bhavana Dalvi Mishra , Oyvind Tafjord , Peter Jansen , Peter Clark , and Benjamin Van Durme

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abstract PDF

Recent language models enable new opportunities for structured reasoning with text, such as the construction of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what _valid decompositional entailment_ is. This absence causes noisy datasets and limited performance gains by modern neuro-symbolic entailment engines. To address these problems, we formulate a consistent and theoretically grounded approach to annotating decompositional entailment and evaluate its impact on LLM-based textual inference. We find that our new dataset, RDTE (Recognizing Decompositional Textual Entailment), has a substantially higher internal consistency than prior decompositional entailment datasets, suggesting that RDTE is a significant step forward in the long-standing problem of forming a clear protocol for discerning entailment. We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in an entailment tree reasoning engine significantly improves both accuracy and proof quality, illustrating the practical benefit of this advance for textual inference.
AAAI

SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

Dongwei Jiang , Jingyu Zhang , Orion Weller , Nathaniel Weir , Benjamin Van Durme , and Daniel Khashabi

Nov 2024

Abstract PDF

Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.
ICASSP

A Further Study of Unsupervised Pretraining for Transformer Based Speech Recognition

Dongwei Jiang , Wubo Li , Ruixiong Zhang , Miao Cao , Ne Luo , Yang Han , Wei Zou , Kun Han , and Xiangang Li

In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, Nov 2021

Abstract PDF

Building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, many unsupervised pre-training methods have been proposed. Among these methods, Masked Predictive Coding achieved significant improvements on various speech recognition datasets with BERT-like Masked Reconstruction loss and Transformer backbone. However, many aspects of MPC have not been fully investigated. In this paper, we conduct a further study on MPC and focus on three important aspects: the effect of pre-training data speaking style, its extension on streaming model, and how to better transfer learned knowledge from pre-training stage to downstream tasks. Experiments reveled that pre-training data with matching speaking style is more useful on downstream recognition tasks. A unified training objective with APC and MPC provided 8.46% relative error reduction on streaming model trained on HKUST. Also, the combination of target data adaption and layer-wise discriminative training helped the knowledge transfer of MPC, which achieved 3.99% relative error reduction on AISHELL over a strong baseline.
InterSpeech

Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-Supervised Speech Representation Learning

Dongwei Jiang , Wubo Li , Miao Cao , Wei Zou , and Xiangang Li

In 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021, Nov 2021

Abstract PDF

Self-supervised visual pretraining has shown significant progress recently. Among those methods, SimCLR greatly advanced the state of the art in self-supervised and semi-supervised learning on ImageNet. The input feature representations for speech and visual tasks are both continuous, so it is natural to consider applying similar objective on speech representation learning. In this paper, we propose Speech SimCLR, a new self-supervised objective for speech representation learning. During training, Speech SimCLR applies augmentation on raw speech and its spectrogram. Its objective is the combination of contrastive loss that maximizes agreement between differently augmented samples in the latent space and reconstruction loss of input representation. The proposed method achieved competitive results on speech emotion recognition and speech recognition.
arxiv

Improving Transformer-based Speech Recognition Using Unsupervised Pre-training

Dongwei Jiang , Xiaoning Lei , Wubo Li , Ne Luo , Yuxuan Hu , Wei Zou , and Xiangang Li

CoRR, Nov 2019

Abstract PDF

Speech recognition technologies are gaining enormous popularity in various industrial applications. However, building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, an unsupervised pre-training method called Masked Predictive Coding is proposed, which can be applied for unsupervised pre-training with Transformer based model. Experiments on HKUST show that using the same training data, we can achieve CER 23.3%, exceeding the best end-to-end model by over 0.2% absolute CER. With more pre-training data, we can further reduce the CER to 21.0%, or a 11.8% relative CER reduction over baseline.

Dongwei Jiang

Speech researcher in the past, NLP researcher now

Second year Master's student

CLSP, Johns Hopkins University

Research Interest

More About Me

Selected Publications