Exploiting Zero-Entropy Data for Efficient Deduplication

Al Jarah, Mu'men

Exploiting Zero-Entropy Data for Efficient Deduplication

Files

Al Jarah_Mu'men.pdf (1.73 MB)

Date

2025-09-15

Authors

Al Jarah, Mu'men

Advisor

Al-Kiswany, Samer

Publisher

University of Waterloo

Abstract

As the volume of digital data continues to grow rapidly, efficient data reduction techniques, such as deduplication, are essential for managing storage and bandwidth. A key step in deduplication is file chunking, which is typically performed using Content-Defined Chunking (CDC) algorithms. While these algorithms have been studied under random data, their performance in the presence of zero-entropy data, where long sequences of identical bytes appear, has not been explored. Such zero-entropy data are common in real-world datasets and introduce challenges for CDC in deduplication systems. This thesis studies the impact of zero-entropy data on the performance of both hash-based and hashless state-of-the-art CDC algorithms. The results show that existing algorithms, particularly hash-based ones, are inefficient at detecting and handling zero-entropy blocks, especially when these blocks are small, which reduces space savings. To address this issue, I propose ZERO (Zero-Entropy Region Optimization), a system that can be integrated into the deduplication pipeline. ZERO identifies and extracts zero-entropy blocks prior to chunking, compresses them using Run-Length Encoding (RLE), and stores their metadata for later reconstruction. ZERO improves deduplication space savings by up to 29% without impacting throughput.

URI

https://hdl.handle.net/10012/22414

Collections

Theses
Computer Science

Full item page

Exploiting Zero-Entropy Data for Efficient Deduplication

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections