Post-Training Large Language Models as Software Engineering Agents
| dc.contributor.author | LYU, Zhiheng | |
| dc.date.accessioned | 2026-04-28T12:48:31Z | |
| dc.date.available | 2026-04-28T12:48:31Z | |
| dc.date.issued | 2026-04-28 | |
| dc.date.submitted | 2026-04-13 | |
| dc.description.abstract | Large language models (LLMs) have demonstrated remarkable capabilities in code un- derstanding and generation, yet a significant gap remains between static code generation and interactive software engineering. This thesis investigates the post-training of LLMs as software engineering agents, focusing on three interconnected challenges: infrastructure, data, and training methodology. First, we contribute to VerlTool, a unified framework for agentic reinforcement learn- ing with tool integration (ARLT). The author’s contributions center on the training orches- tration layer — the stateful environment protocol, environment server architecture, and SWE agent post-training pipeline — which make tool-augmented RL training practical and accessible for researchers. Second, we address the critical bottleneck of training data and evaluation infrastructure. SWE-Next provides a scalable, Ray-native pipeline for synthesizing verifiable software engineering tasks from open-source repositories (ongoing work with intermediate results reported). For SWE-QA-Pro, a representative benchmark for code question answering, the author contributes the data sourcing and synthesis pipeline. Third, we investigate the post-training design space for software engineering agents, spanning supervised fine-tuning (SFT), rejection fine-tuning (RFT), RL from AI feed- back (RLAIF), and RL with verifiable rewards (RLVR). Through three complementary case studies—code question answering (SFT + RLAIF), web-based information retrieval (SFT + RFT), and repository-level bug fixing (RLVR)—we demonstrate that the opti- mal training recipe depends on task characteristics such as reward verifiability, exploration complexity, and data availability. Our experiments show that task-specific post-training of smaller open-weight models can be competitive with larger proprietary models, and that matching the training method to the task structure is more important than uniformly applying all stages. | |
| dc.identifier.uri | https://hdl.handle.net/10012/23070 | |
| dc.language.iso | en | |
| dc.pending | false | |
| dc.publisher | University of Waterloo | en |
| dc.title | Post-Training Large Language Models as Software Engineering Agents | |
| dc.type | Master Thesis | |
| uws-etd.degree | Master of Mathematics | |
| uws-etd.degree.department | David R. Cheriton School of Computer Science | |
| uws-etd.degree.discipline | Computer Science | |
| uws-etd.degree.grantor | University of Waterloo | en |
| uws-etd.embargo.terms | 0 | |
| uws.contributor.advisor | Wenhu, Chen | |
| uws.contributor.affiliation1 | Faculty of Mathematics | |
| uws.peerReviewStatus | Unreviewed | en |
| uws.published.city | Waterloo | en |
| uws.published.country | Canada | en |
| uws.published.province | Ontario | en |
| uws.scholarLevel | Graduate | en |
| uws.typeOfResource | Text | en |