Bidirectional TopK Sparsification for Distributed Learning

dc.contributor.advisorDe Sterck, Hans
dc.contributor.advisorLiu, Jun
dc.contributor.authorZou, William
dc.date.accessioned2022-05-27T17:10:36Z
dc.date.available2022-05-27T17:10:36Z
dc.date.issued2022-05-27
dc.date.submitted2022-05-19
dc.description.abstractTraining large neural networks requires a large amount of time. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient compression techniques have been proposed to alleviate the communication bottleneck, including topK gradient sparsification, which truncates the gradient to the top K components before sending it to other nodes. Some authors have adopted topK gradient sparsification to the parameter-server framework by applying topK compression in both the worker-to-server and server-to-worker direction, as opposed to only the worker-to-server direction. Current intuition and analysis suggest that adding extra compression degrades the convergence of the model. We provide a simple counterexample where iterating with bidirectional topK SGD allows better convergence than iterating with unidirectional topK SGD. We explain this example with the theoretical framework developed by Alistarh et al., remove a critical assumption the authors’ made in their non-convex convergence analysis of topK SGD, and show that bidirectional topK SGD can achieve the same convergence bound as unidirectional topK SGD with assumptions that are potentially easier to satisfy. We experimentally evaluate unidirectional topK SGD against bidirectional topK SGD and show that under careful tuning, models trained with bidirectional topK SGD will perform just as well as models trained with unidirectional topK SGD. Finally, we provide empirical evidence that the amount of communication saved by adding server-to-worker topK compression is almost linear to the number of workers.en
dc.identifier.urihttp://hdl.handle.net/10012/18335
dc.language.isoenen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.relation.uricode repository used for experiments: https://github.com/wyxzou/Federated-Learning-PyTorchen
dc.relation.uriMNIST dataset: http://yann.lecun.com/exdb/mnist/en
dc.relation.uriFashion MNIST dataset: https://github.com/zalandoresearch/fashion-mnisten
dc.relation.uriCIFAR10 dataset: https://www.cs.toronto.edu/~kriz/cifar.htmlen
dc.subjectgradient compressionen
dc.subjectdistributed learningen
dc.subjectanalysis of stochastic gradient descenten
dc.titleBidirectional TopK Sparsification for Distributed Learningen
dc.typeMaster Thesisen
uws-etd.degreeMaster of Mathematicsen
uws-etd.degree.departmentApplied Mathematicsen
uws-etd.degree.disciplineData Scienceen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0en
uws.contributor.advisorDe Sterck, Hans
uws.contributor.advisorLiu, Jun
uws.contributor.affiliation1Faculty of Mathematicsen
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zou_William.pdf
Size:
1.47 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: