Papers
arxiv:2405.18400

Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

Published on May 28
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing k drafts to the user requires running an expensive language model k times. To alleviate the computation cost of running k inference passes, we propose Superposed Decoding, a new decoding algorithm that generates k drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the k drafts as input to the next decoding step of the language model. At every inference step we combine the k drafts with the top-k tokens to get k^2 new drafts and cache the k most likely options, using an n-gram interpolation with minimal compute overhead to filter out incoherent generations. Our experiments show that k drafts from Superposed Decoding are at least as coherent and factual as Nucleus Sampling and Greedy Decoding respectively, while being at least 2.44times faster for kge3. In a compute-normalized setting, user evaluations demonstrably favor text generated by Superposed Decoding over Nucleus Sampling. Code and more examples open-sourced at https://github.com/RAIVNLab/SuperposedDecoding.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.18400 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.18400 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 1