Acceleration of Integer Transformer Models Via Structured Resource Management Using FPGAs

Elayat, Omar

Acceleration of Integer Transformer Models Via Structured Resource Management Using FPGAs

dc.contributor.author	Elayat, Omar
dc.date.accessioned	2025-08-19T19:15:26Z
dc.date.available	2025-08-19T19:15:26Z
dc.date.issued	2025-08-19
dc.date.submitted	2025-08-14
dc.description.abstract	The widespread adoption of Large Language Models (LLMs) in various applications has pushed the demand for efficient hardware acceleration beyond the capabilities of traditional platforms. Due to their highly parallel architecture and ease of deployment, Field Programmable Gate Arrays (FPGAs) are widely used to accelerate LLMs. However, the FPGAs’ limited on-chip memory resources are still too limited to accommodate the trained models. While existing FPGA-based solutions have demonstrated promising throughput and energy efficiency, they often rely on abundant fabric resources, assume high-bandwidth devices that are not suitable for deployment at the edge, or employ highly customized acceleration architectures that are not scalable with the advancements of the LLMs architectures. This thesis addresses these challenges by proposing a novel on-chip resources manager architecture for integer encoder-based transformer inference, with a focus on Bidirectional Encoder Representations from Transformers (BERT) models. We target resource-constrained FPGAs with limited memory bandwidth. We show that, through structured operation scheduling and resource-sharing, significant performance improvements can be achieved. The proposed resource-shared infrastructure is also designed to be modular, allowing newly introduced computation blocks to be easily integrated into the accelerator without requiring major modifications or incurring additional off-chip data movement. Demonstrated on a fully quantized integer-only variant of the BERT model as a representative workload, the proposed system achieves 2.32x latency improvement over the baseline custom accelerator, 1.17x over Jetson Orin Nano GPU, and at least 23.63x over CPU. The design is validated on two FPGAs: the PYNQ-Z1 as a low-end proof-of-concept and the KV260 as a mid-range deployment target.
dc.identifier.uri	https://hdl.handle.net/10012/22203
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.subject	FPGA
dc.subject	LLMs
dc.subject	Accelerator
dc.title	Acceleration of Integer Transformer Models Via Structured Resource Management Using FPGAs
dc.type	Master Thesis
uws-etd.degree	Master of Applied Science
uws-etd.degree.department	Electrical and Computer Engineering
uws-etd.degree.discipline	Electrical and Computer Engineering
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	1 year
uws.contributor.advisor	Gaudet, Vincent
uws.contributor.affiliation1	Faculty of Engineering
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Elayat_Omar.pdf
Size:: 2.39 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Electrical and Computer Engineering