LLM-Based Frameworks for Information Retrieval Evaluation

dc.contributor.authorUpadhyay, Shivani Jayantkumar
dc.date.accessioned2026-06-29T14:17:46Z
dc.date.available2026-06-29T14:17:46Z
dc.date.issued2026-06-29
dc.date.submitted2026-06-22
dc.description.abstractEvaluating information retrieval (IR) systems requires a reference that captures what correct or relevant output looks like, as well as a mechanism for determining whether a system’s output matches that reference. For lexical retrieval systems, both requirements are relatively straightforward. Systems rank documents by term overlap, pooling produces a judgment file that covers most documents any system is likely to return, and determining relevance reduces to a simple membership test against that file. This evaluation paradigm relies on the assumption that relevance can be detected through surface-form overlap. When retrieval moves beyond that assumption, the framework begins to break down. Retrieval-augmented generation (RAG) systems strain this setup by synthesising free-form natural language responses from retrieved evidence. A gold answer set constructed before system execution cannot anticipate every correct phrasing, so even semantically correct outputs can fail under lexical matching. Dense retrieval systems encode queries and documents as vectors, retrieving relevant documents that might not share vocabulary with the query. Under pooling-based evaluation, these documents never receive human judgments and are instead assigned a default relevance grade of zero. Together, these failures highlight the limits of surface-form evaluation and point to the need for judgment mechanisms that reason directly about meaning. This thesis investigates whether large language models (LLMs) can fill this gap by contributing three frameworks across successive layers of the evaluation pipeline. The first contribution is an open-source QA evaluation framework that combines chain-of-thought (CoT) prompting with self-consistency decoding using instruction-tuned LLMs. When evaluated across 12 systems on NQ-open, it matches zero-shot GPT‑4 in rank correlation with human judgments while using a model more than an order of magnitude smaller, demonstrating that prompting strategy can matter as much as scale. The second contribution is a framework for patching incomplete relevance judgment sets by assigning four-level TREC-style labels to unjudged query-passage pairs via few-shot prompting. When evaluated across five TREC Deep Learning Track collections at removal rates varying from 10 to 90%, it substantially improves system ranking fidelity over the standard practice of treating unjudged documents as non-relevant. The third contribution is UMBRELA, which is a fully automated open-source relevance assessment framework deployed in the TREC 2024 RAG Track across 301 topics, achieving run-level Kendall's tau >= 0.86 against fully manual assessment. All frameworks are released as open-source tools to support reproducible and scalable IR evaluation.
dc.identifier.urihttps://hdl.handle.net/10012/23678
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.titleLLM-Based Frameworks for Information Retrieval Evaluation
dc.typeMaster Thesis
uws-etd.degreeMaster of Mathematics
uws-etd.degree.departmentDavid R. Cheriton School of Computer Science
uws-etd.degree.disciplineComputer Science
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorLin, Jimmy
uws.contributor.affiliation1Faculty of Mathematics
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Upadhyay_Shivani.pdf
Size:
3.08 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: