Information Extraction for Low-Resource Schemas

dc.contributor.authorXu, Justin
dc.date.accessioned2026-06-04T17:19:56Z
dc.date.available2026-06-04T17:19:56Z
dc.date.issued2026-06-04
dc.date.submitted2026-05-04
dc.description.abstractInformation Extraction (IE) is a set of important tasks in the study of creating structured data such as knowledge graphs from unstructured data such as text. The past paradigm of IE focused on models with specialized neural network architectures, usually based on transformer encoders. These models typically focus on a single subtask of IE, following a single schema of entity and relation types, and are trained via supervised learning on large datasets of annotated texts. Meanwhile, the current paradigm of IE, called Universal IE (UIE), involves large language models which can generalize across IE subtasks and to completely unseen schemas, but which lack other abilities such as entity grounding and calibration. We first discuss structural consistency, a new measure of robustness in information extraction based on compositionality. We present structural consistency post-training (SCPT) as a data augmentation method to boost structural consistency for a wide range of model architectures. Besides greatly improving robustness, SCPT significantly reduces the amount of labelled data needed to achieve the same level of performance when training specialized IE models. Second, we use reasoning-based data augmentation techniques to gather AdaIE, a very large collection of human-annotated information extraction schemas. We diverge from UIE and align the dataset with a new task we call Guided Information Extraction (GIE). GIE emphasizes the tight grounding and schema-following requirements which have been largely neglected in UIE. Evaluations of state-of-the-art UIE models reveal that state of the art UIE methods can be surpassed by recent commercial large language models (LLMs). Although those LLMs achieve below human performance on AdaIE, they are rapidly advancing. Overall, we hope that both works presented will steer the IE research community towards unifying the strengths of the old and new IE paradigms, while casting light on their weaknesses.
dc.identifier.urihttps://hdl.handle.net/10012/23544
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.relation.urihttps://github.com/xujustinj/t2g-consistency
dc.relation.urihttps://github.com/adanomad/AdaIE
dc.subjectinformation extraction
dc.subjectknowledge graphs
dc.subjectlarge language models
dc.subjectrelation extraction
dc.subjectentity extraction
dc.subjectdata augmentation
dc.subjectconsistency training
dc.subjectdataset
dc.subjectnatural language processing
dc.titleInformation Extraction for Low-Resource Schemas
dc.typeMaster Thesis
uws-etd.degreeMaster of Mathematics
uws-etd.degree.departmentDavid R. Cheriton School of Computer Science
uws-etd.degree.disciplineComputer Science
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.comment.hiddenThe GitHub code repositories are currently private due to ongoing or anticipated conference submissions, and they may not be the final resting places of the code. I would like them to be hidden until future notification that the repository is available (or moved to another location). However, if that is not possible, I would prefer that the repositories are not included at all.
uws.contributor.advisorPoupart, Pascal
uws.contributor.affiliation1Faculty of Mathematics
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Xu_Justin.pdf
Size:
2.18 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: