Information Extraction for Low-Resource Schemas

Xu, Justin

Information Extraction for Low-Resource Schemas

dc.contributor.author	Xu, Justin
dc.date.accessioned	2026-06-04T17:19:56Z
dc.date.available	2026-06-04T17:19:56Z
dc.date.issued	2026-06-04
dc.date.submitted	2026-05-04
dc.description.abstract	Information Extraction (IE) is a set of important tasks in the study of creating structured data such as knowledge graphs from unstructured data such as text. The past paradigm of IE focused on models with specialized neural network architectures, usually based on transformer encoders. These models typically focus on a single subtask of IE, following a single schema of entity and relation types, and are trained via supervised learning on large datasets of annotated texts. Meanwhile, the current paradigm of IE, called Universal IE (UIE), involves large language models which can generalize across IE subtasks and to completely unseen schemas, but which lack other abilities such as entity grounding and calibration. We first discuss structural consistency, a new measure of robustness in information extraction based on compositionality. We present structural consistency post-training (SCPT) as a data augmentation method to boost structural consistency for a wide range of model architectures. Besides greatly improving robustness, SCPT significantly reduces the amount of labelled data needed to achieve the same level of performance when training specialized IE models. Second, we use reasoning-based data augmentation techniques to gather AdaIE, a very large collection of human-annotated information extraction schemas. We diverge from UIE and align the dataset with a new task we call Guided Information Extraction (GIE). GIE emphasizes the tight grounding and schema-following requirements which have been largely neglected in UIE. Evaluations of state-of-the-art UIE models reveal that state of the art UIE methods can be surpassed by recent commercial large language models (LLMs). Although those LLMs achieve below human performance on AdaIE, they are rapidly advancing. Overall, we hope that both works presented will steer the IE research community towards unifying the strengths of the old and new IE paradigms, while casting light on their weaknesses.
dc.identifier.uri	https://hdl.handle.net/10012/23544
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.relation.uri	https://github.com/xujustinj/t2g-consistency
dc.relation.uri	https://github.com/adanomad/AdaIE
dc.subject	information extraction
dc.subject	knowledge graphs
dc.subject	large language models
dc.subject	relation extraction
dc.subject	entity extraction
dc.subject	data augmentation
dc.subject	consistency training
dc.subject	dataset
dc.subject	natural language processing
dc.title	Information Extraction for Low-Resource Schemas
dc.type	Master Thesis
uws-etd.degree	Master of Mathematics
uws-etd.degree.department	David R. Cheriton School of Computer Science
uws-etd.degree.discipline	Computer Science
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.comment.hidden	The GitHub code repositories are currently private due to ongoing or anticipated conference submissions, and they may not be the final resting places of the code. I would like them to be hidden until future notification that the repository is available (or moved to another location). However, if that is not possible, I would prefer that the repositories are not included at all.
uws.contributor.advisor	Poupart, Pascal
uws.contributor.affiliation1	Faculty of Mathematics
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Xu_Justin.pdf
Size:: 2.18 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science