tecture to have a higher latency than the decoder-
only architecture. They also find the Transformer
architecture to be marginally better than the LSTM
architecture (Hochreiter and Schmidhuber,1997).
Motivated by these findings, we employ a decoder-
only, Transformer based architecture for building
our autocomplete model. Trajanovski et al. (2021)
leverage word-based autocomplete models for pro-
viding email and chat suggestions.
In this work, we focus on building autocomplete
models for broad prompts from domains such as
Wikipedia, where user prompt patterns can be quite
low frequency (e.g., prompt about
Bruce Vilanch
(Oscars writer), with frequency of only 6 times).
Unlike our prompt completion task, query auto-
completion task is a well researched problem (Bar-
Yossef and Kraus,2011;Cai and de Rijke,2016;
Wang et al.,2020;Gog et al.,2020), where the
goal is to complete the user’s query, e.g., search
query. Since user queries are generally short, query
autocomplete models need not track long-range
dependencies to understand the user’s intent. In
contrast, it is a requirement in our prompt comple-
tion setting, as the user prompt can be arbitrarily
large, e.g., sentences or paragraphs.
ChatGPT (OpenAI,2023b) and GPT-4 (OpenAI,
2023a) are recent dialogue models, which have gar-
nered a great attention from the AI community for
their ability to converse with human-like capabil-
ities. The data used to train these models are not
disclosed by the authors. As it is entirely possi-
ble for their training data to include the test sets
we study in our work and train-test overlap anal-
ysis cannot be performed, we cannot make a fair
comparison of our work with these ‘closed’ AI
models (Rogers et al.,2023). Models such as Al-
paca (Taori et al.,2023), Vicuna (Chiang et al.,
2023), GPT-4-LLM (Peng et al.,2023) that claim
to perform similarly as ChatGPT with few billion
parameters are usually finetuned with outputs from
ChatGPT or GPT-4. Hence, these models cannot
be fairly compared with our work either.
Efficient Deep Learning. Exponential growth in
the size of Transformer-based autoregressive lan-
guage models (e.g.,
175
B (Brown et al.,2020)) has
given rise to a strong need to make these models
efficient so they can be used on commodity de-
vices like laptop, tablet, and mobile, which have
various resource constraints such as peak memory
utilization and latency, while yielding the best per-
formance under the constraints. To this end, there
has been extensive research on building efficient
Transformer models that are smaller, faster, and bet-
ter, as summarized thoroughly by Tay et al. (2020)
and Menghani (2021). Our work is focused on im-
proving the efficiency of a natural language gener-
ation task (e.g., autocomplete), which has received
less attention from an efficiency perspective. Wang
et al. (2021) observe that 73% of the overall latency
of autoregressive language models goes to memory
intensive data movement operations (e.g., splitting
heads, transpose, reshape) and conclude that these
models are memory intensive. Since lower-end
edge platforms have tighter memory constraints
than latency constraints (Cai et al.,2020), we fo-
cus on improving the accuracy-memory trade-off
of autocomplete models.
3 Autocomplete – Fundamentals
Problem. Given a text sequence
x=
(x1, . . . , x|x|)
(user input) with tokens from a fixed
vocabulary
xi∈ V
, the goal of the autocomplete
task is to generate a completion
ˆ
xk+1:N
such that
the resulting sequence (
x1, . . . , xk,ˆxk+1,...,ˆxN
)
resembles a sample from
p∗
, where
p∗(x)
denotes
the reference distribution.
x
can be arbitrarily large
(e.g., paragraphs), while
ˆ
xk+1:N
is generally short
(e.g., three words). Each token
xk
can be a word,
character, or subword. The vocabulary
V
contains
unique tokens from the dataset
D
consisting of a
finite set of text sequences from p∗.
Data. Most datasets in the autocomplete litera-
ture come from domains with focused prompts
(e.g., emails (Chen et al.,2019;Trajanovski et al.,
2021), chat messages (Trajanovski et al.,2021)).
In this work, we target the autocomplete task on
datasets with broad prompts (e.g., Wikipedia) with
a lot of low-frequency prompt patterns (e.g., the
prompt
EACL 2023 conference
). Autocomplete mod-
els trained to answer broad prompts can be used to
assist users in completing documents such as essay,
report, letter, etc.
Metrics. The commonly used metric for evaluat-
ing the quality of an autocomplete model is Ex-
actMatch@N (Rajpurkar et al.,2016) which mea-
sures the percentage of the first
N
words in the
predicted suggestion that exactly match the first
N
words in the ground truth suggestion. Exact-
Match@Overall (Chen et al.,2019) is a weighted
average of the ExactMatch for all subsequence
lengths up to
K
. For our setting, larger n-grams
are increasingly difficult to predict for both word