Unifying Foundation Models with Decision Making: A Path Towards Autonomous Self-Improving Agents beyond Rewards and Human Supervision

Chandra, Abhranil

Unifying Foundation Models with Decision Making: A Path Towards Autonomous Self-Improving Agents beyond Rewards and Human Supervision

dc.contributor.author	Chandra, Abhranil
dc.date.accessioned	2025-08-25T13:51:02Z
dc.date.available	2025-08-25T13:51:02Z
dc.date.issued	2025-08-25
dc.date.submitted	2025-08-18
dc.description.abstract	Recent progress in AI has primarily been driven along two axes - scaling large generative Foundation Models (FMs) trained on static internet scale data and developing better sequential decision making (DM) and reinforcement learning (RL) algorithms that allow experiential learning. Both paradigms alone have their own flaws, but together provide a scalable recipe towards general intelligence. As AlphaGo made its famous "Move 37", it reaffirmed Reinforcement Learning's(RL) efficacy as a paradigm to bootstrap intelligence via interactive learning from scratch, optimizing for goal-directed behavior, self-improvement, and emergence of novel superhuman abilities. However, RL is hard to scale beyond narrow tasks or simulated environments given the scarcity of real-world decision-centric data, sparsity of feedback, hard-to-design rewards, and problems of scaling to larger models. On the contrary, recent generative FMs pretrained on static internet-scale text and image data excel at broad high-level knowledge acquisition but fail to acquire robust internal World Models, precluding their notion of agency, ability to plan and reason well, and extrapolation beyond training data. In the thesis we focused on combining these complementary paradigms - to both build and use FMs for DM and develop DM and RL tools for improving FMs - using DM/RL as a paradigm to finetune and optimize general-purpose pretrained models to elicit better decisions beyond training data, rather than viewing it as a paradigm to bootstrap intelligence from scratch. Such broadly capable systems can then be used to empower agents that can perceive, reason, and act robustly in both physical settings to complete embodied tasks and in virtual settings to help as autonomous task completion and knowledge-discovery agents. First, we introduce VideoAgent, a jointly trained goal-conditioned video generation policy and self-improving simulator for embodied planning tasks. VideoAgent learns to refine its own generated plans based on a novel self-conditioning consistency objective and also using feedback from pretrained vision-language models (VLMs), without requiring ground-truth action labels or any explicit rewards. The model further leverages search to enable iterative improvement of the video plans using inference-time compute, leading to more grounded and physically plausible plans in robotic manipulation tasks. Second, we develop a framework, Reg-ReBoot, to investigate efficient and scalable methods to improve base non-reasoning LLMs into better reasoners without using explicit verified data or rule-based verifiers. We analyze this counterintuitive idea: fine-tuning language models using unverified and even incorrect reasoning traces leads to reasoning improvement. We show that large language models (LLMs), due to their inductive biases, can learn useful reasoning heuristics by averaging over noisy chain-of-thought (CoT) data. Our results on mathematical reasoning benchmarks reveal that noisy synthetic data can be an efficient way to bootstrap performance and decrease reliance on hard to curate verified solutions. From these insights we propose a two stage mid-training pipeline - lowering the barrier to scalable reasoning improvement. Finally, we address the evaluation bottleneck in generative modeling by proposing ReFeR, a multi-agent, tuning-free evaluation framework. ReFeR uses a hierarchy of pretrained LLMs and VLMs to provide automatic, scalable, and high-quality feedback for both textual and visual generations. This framework not only rivals human-level evaluation accuracy but also produces structured feedback, enabling downstream distillation and fine-tuning, and proving effective even in complex reasoning tasks. Together, through these works, I try to contribute towards the goal of building autonomous self-improving agents: one where systems powered by foundation models leverage test-time compute, generative simulations and world models, and diverse learning signals beyond explicit rewards and human feedback to drive interactive learning and decision making.
dc.identifier.uri	https://hdl.handle.net/10012/22246
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.subject	Reinforcement Learning
dc.subject	Self-Improvement
dc.subject	Sequential Decision Making
dc.subject	Unsupervised Policy Learning
dc.subject	Generative World Models
dc.subject	Planning
dc.subject	Search
dc.subject	Machine Learning
dc.subject	Video Policy
dc.subject	Model-Based Learning
dc.subject	Robot Learning
dc.subject	LLM
dc.subject	Reasoning
dc.subject	Scalable Oversight
dc.title	Unifying Foundation Models with Decision Making: A Path Towards Autonomous Self-Improving Agents beyond Rewards and Human Supervision
dc.type	Master Thesis
uws-etd.degree	Master of Mathematics
uws-etd.degree.department	David R. Cheriton School of Computer Science
uws-etd.degree.discipline	Computer Science
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.comment.hidden	I have made all the recommended formatting related edits and uploaded the new pdf.
uws.contributor.advisor	Fischmeister, Sebastian
uws.contributor.affiliation1	Faculty of Mathematics
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Chandra_Abhranil.pdf
Size:: 34.98 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science