Sequence Parallelism: Long Sequence Training from System Perspective

Shenggui Li; Fuzhao Xue; Chaitanya Baranwal; Yongbin Li; Yang You

doi:10.18653/v1/2023.acl-long.134

Sequence Parallelism: Long Sequence Training from System Perspective

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You

Abstract

Transformer achieves promising results on various tasks. However, self-attention suffers from quadratic memory requirements with respect to the sequence length. Existing work focuses on reducing time and space complexity from an algorithm perspective. In this work, we propose sequence parallelism, a memory-efficient parallelism to solve this issue from system perspective instead. Our approach is compatible with most existing parallelisms (e.g., data, pipeline, and tensor parallelism), which means our sequence parallelism makes 4D parallelism possible. More importantly, we no longer require a single device to hold the whole sequence. Besides, using efficient attention with linear complexity, our sequence parallelism enables us to train transformer with infinite long sequence. Specifically, we split the input sequence into multiple chunks and feed each chunk into its corresponding device (i.e., GPU). To compute the attention output, we integrated ring-style communication with self-attention calculation and proposed Ring Self-Attention (RSA). Experiments show that sequence parallelism performs well when scaling with batch size and sequence length. Compared with tensor parallelism, our approach achieved 13.7× and 3.0× maximum batch size and sequence length respectively when scaling up to 64 NVIDIA P100 GPUs. With efficient attention, sequence can handle sequence with over 114K tokens, which is over 27× longer than existing efficient attention works holding the whole sequence on a single device.

Anthology ID:: 2023.acl-long.134
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2391–2404
Language:
URL:: https://aclanthology.org/2023.acl-long.134
DOI:: 10.18653/v1/2023.acl-long.134
Bibkey:
Cite (ACL):: Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023. Sequence Parallelism: Long Sequence Training from System Perspective. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391–2404, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Sequence Parallelism: Long Sequence Training from System Perspective (Li et al., ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-long.134.pdf
Video:: https://aclanthology.org/2023.acl-long.134.mp4

PDF Cite Search Video