Bidirectional TopK Sparsification for Distributed Learning

Zou, William

Bidirectional TopK Sparsification for Distributed Learning

dc.contributor.advisor	De Sterck, Hans
dc.contributor.advisor	Liu, Jun
dc.contributor.author	Zou, William
dc.date.accessioned	2022-05-27T17:10:36Z
dc.date.available	2022-05-27T17:10:36Z
dc.date.issued	2022-05-27
dc.date.submitted	2022-05-19
dc.description.abstract	Training large neural networks requires a large amount of time. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient compression techniques have been proposed to alleviate the communication bottleneck, including topK gradient sparsification, which truncates the gradient to the top K components before sending it to other nodes. Some authors have adopted topK gradient sparsification to the parameter-server framework by applying topK compression in both the worker-to-server and server-to-worker direction, as opposed to only the worker-to-server direction. Current intuition and analysis suggest that adding extra compression degrades the convergence of the model. We provide a simple counterexample where iterating with bidirectional topK SGD allows better convergence than iterating with unidirectional topK SGD. We explain this example with the theoretical framework developed by Alistarh et al., remove a critical assumption the authors’ made in their non-convex convergence analysis of topK SGD, and show that bidirectional topK SGD can achieve the same convergence bound as unidirectional topK SGD with assumptions that are potentially easier to satisfy. We experimentally evaluate unidirectional topK SGD against bidirectional topK SGD and show that under careful tuning, models trained with bidirectional topK SGD will perform just as well as models trained with unidirectional topK SGD. Finally, we provide empirical evidence that the amount of communication saved by adding server-to-worker topK compression is almost linear to the number of workers.	en
dc.identifier.uri	http://hdl.handle.net/10012/18335
dc.language.iso	en	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.relation.uri	code repository used for experiments: https://github.com/wyxzou/Federated-Learning-PyTorch	en
dc.relation.uri	MNIST dataset: http://yann.lecun.com/exdb/mnist/	en
dc.relation.uri	Fashion MNIST dataset: https://github.com/zalandoresearch/fashion-mnist	en
dc.relation.uri	CIFAR10 dataset: https://www.cs.toronto.edu/~kriz/cifar.html	en
dc.subject	gradient compression	en
dc.subject	distributed learning	en
dc.subject	analysis of stochastic gradient descent	en
dc.title	Bidirectional TopK Sparsification for Distributed Learning	en
dc.type	Master Thesis	en
uws-etd.degree	Master of Mathematics	en
uws-etd.degree.department	Applied Mathematics	en
uws-etd.degree.discipline	Data Science	en
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0	en
uws.contributor.advisor	De Sterck, Hans
uws.contributor.advisor	Liu, Jun
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Zou_William.pdf
Size:: 1.47 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Applied Mathematics