Encoding FHIR Medical Data for Transformers

Yu, Trevor

Encoding FHIR Medical Data for Transformers

dc.contributor.author	Yu, Trevor
dc.date.accessioned	2025-04-29T17:49:50Z
dc.date.available	2025-04-29T17:49:50Z
dc.date.issued	2025-04-29
dc.date.submitted	2025-04-15
dc.description.abstract	The open source Fast Healthcare Interoperability Resources (FHIR) data standard is becoming increasingly adopted as a format for representing and communicating medical data. FHIR represents various types of medical data as resources, which have a standardized JSON structure. FHIR boasts the advantage of interoperability and can be used for electronic medical record storage and, more recently, machine learning analytics. Recent trends in the machine learning field have been the development of large, foundation models that are trained on large volumes of unstructured data. Transformers are a deep neural network architecture for sequence modelling and have been used to build foundation models for natural language processing. Text is input to transformers as a sequence of tokens. Tokenization algorithms break text into discrete chunks, called tokens. Using language tokenizers on FHIR JSON data is inefficient, producing several hundred text tokens per resource. Patient records may contain several thousand resources, which overall exceeds the total number of tokens that most text transformers can handle. Additionally, discrete encoding of numeric and time data may not be appropriate for these continuous quantities. In this thesis, I design a tokenization method that operates on data using the open source Health Level 7 FHIR standard. This method takes JSON returned from a FHIR server query and assigns tokens to chunks of JSON, based on FHIR data structures. The FHIR tokens can be used to train transformer models, and the methodology to train FHIR transformer models on sequence classification and masked language modelling tasks is presented. The performance of this method on the open source MIMIC-IV FHIR dataset is validated for length-of-stay prediction (LOS) and mortality prediction (MP) tasks. In addition, I explore methods for encoding numerical and time-delta values using continuous vector encodings rather than assigning discrete tokens to values. I also explore using compression methods to reduce the long sequence lengths. Previous works using MIMIC-IV have reported their performance on the LOS and MP tasks using XGBoost models, which use bespoke feature encodings. The results show that the FHIR transformer can perform the LOS task better than an XGBoost model, but the transformer performs worse at the MP task. None of the continuous encoding methods perform significantly better than discrete encoding methods, but they are not worse either. Compression methods provide a performance improvement on long sequence lengths in both accuracy and inference speed. Since performance is task dependent, future research should validate the performance of this method on other datasets and tasks. MIMIC-IV is too small to see benefits of pre-training, but if a larger dataset can be obtained, the methodology developed in this work could be applied towards creating a large FHIR foundation models.
dc.identifier.uri	https://hdl.handle.net/10012/21680
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.relation.uri	https://physionet.org/content/mimic-iv-fhir/1.0/
dc.subject	MEDICINE::Physiology and pharmacology::Physiology::Medical informatics
dc.subject	Deep learning
dc.subject	Fast Healthcare Interoperability Resources (FHIR)
dc.title	Encoding FHIR Medical Data for Transformers
dc.type	Master Thesis
uws-etd.degree	Master of Applied Science
uws-etd.degree.department	Systems Design Engineering
uws-etd.degree.discipline	System Design Engineering
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	1 year
uws.comment.hidden	Corrected degree title from "Masters" to "Master" as per feedback.
uws.contributor.advisor	Tripp, Bryan
uws.contributor.affiliation1	Faculty of Engineering
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Yu_Trevor.pdf
Size:: 3.71 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Systems Design Engineering