The University of Waterloo Libraries will be performing maintenance on UWSpace tomorrow, November 5th, 2025, from 10 am – 6 pm EST.
UWSpace will be offline for all UW community members during this time. Please avoid submitting items to UWSpace until November 7th, 2025.

Unifying Foundation Models with Decision Making: A Path Towards Autonomous Self-Improving Agents beyond Rewards and Human Supervision

dc.contributor.authorChandra, Abhranil
dc.date.accessioned2025-08-25T13:51:02Z
dc.date.available2025-08-25T13:51:02Z
dc.date.issued2025-08-25
dc.date.submitted2025-08-18
dc.description.abstractRecent progress in AI has primarily been driven along two axes - scaling large generative Foundation Models (FMs) trained on static internet scale data and developing better sequential decision making (DM) and reinforcement learning (RL) algorithms that allow experiential learning. Both paradigms alone have their own flaws, but together provide a scalable recipe towards general intelligence. As AlphaGo made its famous "Move 37", it reaffirmed Reinforcement Learning's(RL) efficacy as a paradigm to bootstrap intelligence via interactive learning from scratch, optimizing for goal-directed behavior, self-improvement, and emergence of novel superhuman abilities. However, RL is hard to scale beyond narrow tasks or simulated environments given the scarcity of real-world decision-centric data, sparsity of feedback, hard-to-design rewards, and problems of scaling to larger models. On the contrary, recent generative FMs pretrained on static internet-scale text and image data excel at broad high-level knowledge acquisition but fail to acquire robust internal World Models, precluding their notion of agency, ability to plan and reason well, and extrapolation beyond training data. In the thesis we focused on combining these complementary paradigms - to both build and use FMs for DM and develop DM and RL tools for improving FMs - using DM/RL as a paradigm to finetune and optimize general-purpose pretrained models to elicit better decisions beyond training data, rather than viewing it as a paradigm to bootstrap intelligence from scratch. Such broadly capable systems can then be used to empower agents that can perceive, reason, and act robustly in both physical settings to complete embodied tasks and in virtual settings to help as autonomous task completion and knowledge-discovery agents. First, we introduce VideoAgent, a jointly trained goal-conditioned video generation policy and self-improving simulator for embodied planning tasks. VideoAgent learns to refine its own generated plans based on a novel self-conditioning consistency objective and also using feedback from pretrained vision-language models (VLMs), without requiring ground-truth action labels or any explicit rewards. The model further leverages search to enable iterative improvement of the video plans using inference-time compute, leading to more grounded and physically plausible plans in robotic manipulation tasks. Second, we develop a framework, Reg-ReBoot, to investigate efficient and scalable methods to improve base non-reasoning LLMs into better reasoners without using explicit verified data or rule-based verifiers. We analyze this counterintuitive idea: fine-tuning language models using unverified and even incorrect reasoning traces leads to reasoning improvement. We show that large language models (LLMs), due to their inductive biases, can learn useful reasoning heuristics by averaging over noisy chain-of-thought (CoT) data. Our results on mathematical reasoning benchmarks reveal that noisy synthetic data can be an efficient way to bootstrap performance and decrease reliance on hard to curate verified solutions. From these insights we propose a two stage mid-training pipeline - lowering the barrier to scalable reasoning improvement. Finally, we address the evaluation bottleneck in generative modeling by proposing ReFeR, a multi-agent, tuning-free evaluation framework. ReFeR uses a hierarchy of pretrained LLMs and VLMs to provide automatic, scalable, and high-quality feedback for both textual and visual generations. This framework not only rivals human-level evaluation accuracy but also produces structured feedback, enabling downstream distillation and fine-tuning, and proving effective even in complex reasoning tasks. Together, through these works, I try to contribute towards the goal of building autonomous self-improving agents: one where systems powered by foundation models leverage test-time compute, generative simulations and world models, and diverse learning signals beyond explicit rewards and human feedback to drive interactive learning and decision making.
dc.identifier.urihttps://hdl.handle.net/10012/22246
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectReinforcement Learning
dc.subjectSelf-Improvement
dc.subjectSequential Decision Making
dc.subjectUnsupervised Policy Learning
dc.subjectGenerative World Models
dc.subjectPlanning
dc.subjectSearch
dc.subjectMachine Learning
dc.subjectVideo Policy
dc.subjectModel-Based Learning
dc.subjectRobot Learning
dc.subjectLLM
dc.subjectReasoning
dc.subjectScalable Oversight
dc.titleUnifying Foundation Models with Decision Making: A Path Towards Autonomous Self-Improving Agents beyond Rewards and Human Supervision
dc.typeMaster Thesis
uws-etd.degreeMaster of Mathematics
uws-etd.degree.departmentDavid R. Cheriton School of Computer Science
uws-etd.degree.disciplineComputer Science
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.comment.hiddenI have made all the recommended formatting related edits and uploaded the new pdf.
uws.contributor.advisorFischmeister, Sebastian
uws.contributor.affiliation1Faculty of Mathematics
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Chandra_Abhranil.pdf
Size:
34.98 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: