Post-Training Large Language Models as Software Engineering Agents
Loading...
Date
Authors
Advisor
Wenhu, Chen
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in code un-
derstanding and generation, yet a significant gap remains between static code generation
and interactive software engineering. This thesis investigates the post-training of LLMs as
software engineering agents, focusing on three interconnected challenges: infrastructure,
data, and training methodology.
First, we contribute to VerlTool, a unified framework for agentic reinforcement learn-
ing with tool integration (ARLT). The author’s contributions center on the training orches-
tration layer — the stateful environment protocol, environment server architecture, and
SWE agent post-training pipeline — which make tool-augmented RL training practical
and accessible for researchers.
Second, we address the critical bottleneck of training data and evaluation infrastructure.
SWE-Next provides a scalable, Ray-native pipeline for synthesizing verifiable software
engineering tasks from open-source repositories (ongoing work with intermediate results
reported). For SWE-QA-Pro, a representative benchmark for code question answering,
the author contributes the data sourcing and synthesis pipeline.
Third, we investigate the post-training design space for software engineering agents,
spanning supervised fine-tuning (SFT), rejection fine-tuning (RFT), RL from AI feed-
back (RLAIF), and RL with verifiable rewards (RLVR). Through three complementary
case studies—code question answering (SFT + RLAIF), web-based information retrieval
(SFT + RFT), and repository-level bug fixing (RLVR)—we demonstrate that the opti-
mal training recipe depends on task characteristics such as reward verifiability, exploration
complexity, and data availability. Our experiments show that task-specific post-training of
smaller open-weight models can be competitive with larger proprietary models, and that
matching the training method to the task structure is more important than uniformly
applying all stages.