LLM-Based Frameworks for Information Retrieval Evaluation

Upadhyay, Shivani Jayantkumar

LLM-Based Frameworks for Information Retrieval Evaluation

dc.contributor.author	Upadhyay, Shivani Jayantkumar
dc.date.accessioned	2026-06-29T14:17:46Z
dc.date.available	2026-06-29T14:17:46Z
dc.date.issued	2026-06-29
dc.date.submitted	2026-06-22
dc.description.abstract	Evaluating information retrieval (IR) systems requires a reference that captures what correct or relevant output looks like, as well as a mechanism for determining whether a system’s output matches that reference. For lexical retrieval systems, both requirements are relatively straightforward. Systems rank documents by term overlap, pooling produces a judgment file that covers most documents any system is likely to return, and determining relevance reduces to a simple membership test against that file. This evaluation paradigm relies on the assumption that relevance can be detected through surface-form overlap. When retrieval moves beyond that assumption, the framework begins to break down. Retrieval-augmented generation (RAG) systems strain this setup by synthesising free-form natural language responses from retrieved evidence. A gold answer set constructed before system execution cannot anticipate every correct phrasing, so even semantically correct outputs can fail under lexical matching. Dense retrieval systems encode queries and documents as vectors, retrieving relevant documents that might not share vocabulary with the query. Under pooling-based evaluation, these documents never receive human judgments and are instead assigned a default relevance grade of zero. Together, these failures highlight the limits of surface-form evaluation and point to the need for judgment mechanisms that reason directly about meaning. This thesis investigates whether large language models (LLMs) can fill this gap by contributing three frameworks across successive layers of the evaluation pipeline. The first contribution is an open-source QA evaluation framework that combines chain-of-thought (CoT) prompting with self-consistency decoding using instruction-tuned LLMs. When evaluated across 12 systems on NQ-open, it matches zero-shot GPT‑4 in rank correlation with human judgments while using a model more than an order of magnitude smaller, demonstrating that prompting strategy can matter as much as scale. The second contribution is a framework for patching incomplete relevance judgment sets by assigning four-level TREC-style labels to unjudged query-passage pairs via few-shot prompting. When evaluated across five TREC Deep Learning Track collections at removal rates varying from 10 to 90%, it substantially improves system ranking fidelity over the standard practice of treating unjudged documents as non-relevant. The third contribution is UMBRELA, which is a fully automated open-source relevance assessment framework deployed in the TREC 2024 RAG Track across 301 topics, achieving run-level Kendall's tau >= 0.86 against fully manual assessment. All frameworks are released as open-source tools to support reproducible and scalable IR evaluation.
dc.identifier.uri	https://hdl.handle.net/10012/23678
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.title	LLM-Based Frameworks for Information Retrieval Evaluation
dc.type	Master Thesis
uws-etd.degree	Master of Mathematics
uws-etd.degree.department	David R. Cheriton School of Computer Science
uws-etd.degree.discipline	Computer Science
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.contributor.advisor	Lin, Jimmy
uws.contributor.affiliation1	Faculty of Mathematics
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Upadhyay_Shivani.pdf
Size:: 3.08 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science