The Libraries will be performing routine maintenance on UWSpace on October 13th, 2025, from 8 - 9 am ET. UWSpace will be unavailable during this time. Service should resume by 9 am ET.
 

Acceleration of Integer Transformer Models Via Structured Resource Management Using FPGAs

dc.contributor.authorElayat, Omar
dc.date.accessioned2025-08-19T19:15:26Z
dc.date.available2025-08-19T19:15:26Z
dc.date.issued2025-08-19
dc.date.submitted2025-08-14
dc.description.abstractThe widespread adoption of Large Language Models (LLMs) in various applications has pushed the demand for efficient hardware acceleration beyond the capabilities of traditional platforms. Due to their highly parallel architecture and ease of deployment, Field Programmable Gate Arrays (FPGAs) are widely used to accelerate LLMs. However, the FPGAs’ limited on-chip memory resources are still too limited to accommodate the trained models. While existing FPGA-based solutions have demonstrated promising throughput and energy efficiency, they often rely on abundant fabric resources, assume high-bandwidth devices that are not suitable for deployment at the edge, or employ highly customized acceleration architectures that are not scalable with the advancements of the LLMs architectures. This thesis addresses these challenges by proposing a novel on-chip resources manager architecture for integer encoder-based transformer inference, with a focus on Bidirectional Encoder Representations from Transformers (BERT) models. We target resource-constrained FPGAs with limited memory bandwidth. We show that, through structured operation scheduling and resource-sharing, significant performance improvements can be achieved. The proposed resource-shared infrastructure is also designed to be modular, allowing newly introduced computation blocks to be easily integrated into the accelerator without requiring major modifications or incurring additional off-chip data movement. Demonstrated on a fully quantized integer-only variant of the BERT model as a representative workload, the proposed system achieves 2.32x latency improvement over the baseline custom accelerator, 1.17x over Jetson Orin Nano GPU, and at least 23.63x over CPU. The design is validated on two FPGAs: the PYNQ-Z1 as a low-end proof-of-concept and the KV260 as a mid-range deployment target.
dc.identifier.urihttps://hdl.handle.net/10012/22203
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectFPGA
dc.subjectLLMs
dc.subjectAccelerator
dc.titleAcceleration of Integer Transformer Models Via Structured Resource Management Using FPGAs
dc.typeMaster Thesis
uws-etd.degreeMaster of Applied Science
uws-etd.degree.departmentElectrical and Computer Engineering
uws-etd.degree.disciplineElectrical and Computer Engineering
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms1 year
uws.contributor.advisorGaudet, Vincent
uws.contributor.affiliation1Faculty of Engineering
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Elayat_Omar.pdf
Size:
2.39 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: