Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints

2025-04-15 0 0 726.57KB 16 页 10玖币
侵权投诉
Small Character Models Match Large Word Models for Autocomplete
Under Memory Constraints
Ganesh Jawahar
, Subhabrata Mukherjee, Debadeepta Dey,
Muhammad Abdul-Mageed♣♢, Laks V.S. Lakshmanan, Caio Cesar Teodoro Mendes,
Gustavo Henrique de Rosa, Shital Shah
University of British Columbia, Microsoft MBZUAI
ganeshjwhr@gmail.com, {laks,amuham01}@cs.ubc.ca,
{Subhabrata.Mukherjee,dedey,caiocesart,gderosa,shitals}@microsoft.com
Abstract
Autocomplete is a task where the user inputs
a piece of text, termed prompt, which is con-
ditioned by the model to generate semanti-
cally coherent continuation. Existing works
for this task have primarily focused on datasets
(e.g., email, chat) with high frequency user
prompt patterns (or focused prompts) where
word-based language models have been quite
effective. In this work, we study the more chal-
lenging setting consisting of low frequency user
prompt patterns (or broad prompts, e.g., prompt
about
93rd academy awards
) and demonstrate
the effectiveness of character-based language
models. We study this problem under memory-
constrained settings (e.g., edge devices and
smartphones), where character-based represen-
tation is effective in reducing the overall model
size (in terms of parameters). We use WikiText-
103 benchmark to simulate broad prompts and
demonstrate that character models rival word
models in exact match accuracy for the auto-
complete task, when controlled for the model
size. For instance, we show that a
20
M pa-
rameter character model performs similar to
an
80
M parameter word model in the vanilla
setting. We further propose novel methods to
improve character models by incorporating in-
ductive bias in the form of compositional infor-
mation and representation transfer from large
word models. Datasets and code used in this
work are available at
https://github.com/
UBC-NLP/char_autocomplete.
1 Introduction
Autocomplete models are conditioned on user-
written prompts or text to generate semantically
coherent continuations. For example, given the
user input “
Filmmaker George Lucas used Tikal as
a
”, a semantically coherent continuation can
be “
filming location
” (Example 1). Autocomplete
models can dramatically reduce keystrokes and im-
prove user’s productivity in a wide range of appli-
Part of work was done as an intern in Microsoft.
cations including email, chat and document author-
ing. Some typical challenges in building a real-
time autocomplete model include: (i) processing
arbitrary length user input (e.g., paragraphs), (ii)
handling low frequency user prompt patterns (or
broad prompts that typically cover wider vocabu-
lary (as in Example 1), and (iii) satisfying memory
constraints of the target device (such as cap on peak
memory utilization).
Despite the importance of the task, there has
been limited research on autocomplete. Existing
works such as Smart Compose (Chen et al.,2019)
and (Trajanovski et al.,2021) train autoregressive
language models on emails and chats, where user
prompt patterns tend to be high-frequency. That
is, the prompts are focused prompts, e.g., a prompt
about
office standups
, that typically cover narrower
vocabulary. All these models are trained at word
level, which leads to two issues: (i) input/output
embedding parameters (less compressible compo-
nent of the Transformer model (Shen et al.,2020)
1
)
occupy a significant share (e.g., more than 77%) of
the parameter budget due to the large vocabulary
size and (ii) tendency to memorize high-frequency
prompt patterns resulting in poor generalization on
the low-frequency ones.
n-gram unigram bigram trigram
Wikitext-103 95.44 84.35 60.63
Reddit 86.41 77.04 54.36
Table 1: Percentage of unique out of vocabulary (OOV)
n-grams in test set of WikiText-103 (broad prompts) vs.
Reddit (focused prompts) datasets.
In this paper, we focus on the autocomplete
task of broad prompts from domains such as
Wikipedia, where user prompt patterns often have
1
Shen et al. (2020) study the effects of quantization on
different components of Transformer model, on the perfor-
mance in various NLP tasks. They find that the embedding
layer is most sensitive to quantization than other components
and requires more bits to keep performance loss acceptable.
arXiv:2210.03251v2 [cs.CL] 7 Jun 2023
low frequency (e.g., prompt about
93rd academy
awards
). For instance, from Table 1, we observe
that WikiText-103 (broad prompts) contains at least
10%
more unique out of vocabulary (OOV) n-
grams compared to the Reddit dataset (focused
prompts). This makes our task more challenging
than conventional settings considered in prior work
which do one of the following: (i) adopt word-
based models that are good at memorizing high-
frequency patterns for focused prompts or (ii) rely
on conventional language modeling which is not
geared for generating precise and short horizon
continuations (see Section 4).
Furthermore, we study this problem for prac-
tical applications under memory-constrained set-
tings. Lower-end edge platforms (e.g., Raspberry
Pi with 256MB of memory (Cai et al.,2020)) have
memory constraints that are more limiting than la-
tency constraints, for supporting various on-device
models. Also, given that autoregressive language
models are memory-bounded (Wang et al.,2021),
we focus on improving the accuracy-memory trade-
off for autocomplete task of broad prompts. Our
work is complementary to existing works in model
compression including those on pruning (Gordon
et al.,2020), quantization (Han et al.,2016) and
distillation (Sanh et al.,2019) that primarily fo-
cus on natural language understanding tasks (e.g.,
text classification). In contrast to these works, we
study the effectiveness of character-based language
models for a natural language generation task (e.g.,
autocomplete).
In this paper, we focus on two research ques-
tions. RQ1: How do character-based autocom-
plete models compare against word counterparts
under memory constraints? RQ2: How to improve
character-based autocomplete models with no neg-
ative impact on memory? We answer RQ1 by
showing that compared to word models, character
models (i) contribute
96%
fewer parameters in the
embedding layer due to a much smaller vocabulary,
(ii) work well on low-frequency (or broad) prompt
patterns (e.g.,
21
% accuracy improvement by us-
ing
20
M character model over
20
M word model,
see Figure 2(a)) and (iii) result in high savings
on peak memory utilization (e.g.,
4.7
% memory
savings by using
20
M character model over
20
M
word model, see Figure 2(b)). When controlled
for model size (number of parameters), we find
that smaller character models (e.g.,
20
M parame-
ters) perform similar to large word models (e.g.,
80
M parameters). We answer RQ2 by developing
novel methods to improve the accuracy of char-
acter models, which unlike previous work, have
minimal impact on memory usage. These methods
introduce inductive bias in the form of composi-
tional information and representation transfer from
large word models (best method). We show that
the best method achieves
1.12
% and
27.3
% relative
accuracy improvements over vanilla character and
vanilla word models respectively with no impact
on memory usage. We discuss the limitations of
our work in Section 8and defer the analysis of
accuracy-latency trade-off to future work while fo-
cusing only on memory-constrained settings in this
work.
Our major contributions are as follows: (1) To
the best of our knowledge, this is the first study
of the autocomplete task for broad prompts in a
memory-constrained setting. (2) We perform an
extensive comparison of character and word mod-
els across diverse architectures and demonstrate
the advantage of character models over large word
models for the autocomplete task on dimensions
like peak memory utilization and model parame-
ters. (3) We introduce novel methods leveraging
inductive bias to further improve the accuracy of
character models with minimal impact on memory
usage.
2 Related Work
Our work leverages advances in neural language
models, autocompletion, and efficient deep learn-
ing.
Neural Language Models. The autocomplete
models we study in this work utilize Transformer-
based (Vaswani et al.,2017) autoregressive neural
language models as backbone. Compared to word
models, character models lag behind in language
modeling performance when controlled for model
size (Al-Rfou et al.,2019;Choe et al.,2019) and
have a high computational complexity due to long
sequence length (Tay et al.,2022). In this work,
we focus on deploying models on lower-end edge
platforms (e.g., Raspberry Pi) where memory, as
opposed to latency, is the major bottleneck.
Autocomplete Task. Despite the pervasiveness
of autocomplete models, there is limited research
in the academic community on the autocomplete
task. Gmail Smart Compose (Chen et al.,2019) is a
popular word-based autocomplete model for email
suggestions. They find the encoder-decoder archi-
tecture to have a higher latency than the decoder-
only architecture. They also find the Transformer
architecture to be marginally better than the LSTM
architecture (Hochreiter and Schmidhuber,1997).
Motivated by these findings, we employ a decoder-
only, Transformer based architecture for building
our autocomplete model. Trajanovski et al. (2021)
leverage word-based autocomplete models for pro-
viding email and chat suggestions.
In this work, we focus on building autocomplete
models for broad prompts from domains such as
Wikipedia, where user prompt patterns can be quite
low frequency (e.g., prompt about
Bruce Vilanch
(Oscars writer), with frequency of only 6 times).
Unlike our prompt completion task, query auto-
completion task is a well researched problem (Bar-
Yossef and Kraus,2011;Cai and de Rijke,2016;
Wang et al.,2020;Gog et al.,2020), where the
goal is to complete the user’s query, e.g., search
query. Since user queries are generally short, query
autocomplete models need not track long-range
dependencies to understand the user’s intent. In
contrast, it is a requirement in our prompt comple-
tion setting, as the user prompt can be arbitrarily
large, e.g., sentences or paragraphs.
ChatGPT (OpenAI,2023b) and GPT-4 (OpenAI,
2023a) are recent dialogue models, which have gar-
nered a great attention from the AI community for
their ability to converse with human-like capabil-
ities. The data used to train these models are not
disclosed by the authors. As it is entirely possi-
ble for their training data to include the test sets
we study in our work and train-test overlap anal-
ysis cannot be performed, we cannot make a fair
comparison of our work with these ‘closed’ AI
models (Rogers et al.,2023). Models such as Al-
paca (Taori et al.,2023), Vicuna (Chiang et al.,
2023), GPT-4-LLM (Peng et al.,2023) that claim
to perform similarly as ChatGPT with few billion
parameters are usually finetuned with outputs from
ChatGPT or GPT-4. Hence, these models cannot
be fairly compared with our work either.
Efficient Deep Learning. Exponential growth in
the size of Transformer-based autoregressive lan-
guage models (e.g.,
175
B (Brown et al.,2020)) has
given rise to a strong need to make these models
efficient so they can be used on commodity de-
vices like laptop, tablet, and mobile, which have
various resource constraints such as peak memory
utilization and latency, while yielding the best per-
formance under the constraints. To this end, there
has been extensive research on building efficient
Transformer models that are smaller, faster, and bet-
ter, as summarized thoroughly by Tay et al. (2020)
and Menghani (2021). Our work is focused on im-
proving the efficiency of a natural language gener-
ation task (e.g., autocomplete), which has received
less attention from an efficiency perspective. Wang
et al. (2021) observe that 73% of the overall latency
of autoregressive language models goes to memory
intensive data movement operations (e.g., splitting
heads, transpose, reshape) and conclude that these
models are memory intensive. Since lower-end
edge platforms have tighter memory constraints
than latency constraints (Cai et al.,2020), we fo-
cus on improving the accuracy-memory trade-off
of autocomplete models.
3 Autocomplete – Fundamentals
Problem. Given a text sequence
x=
(x1, . . . , x|x|)
(user input) with tokens from a fixed
vocabulary
xi∈ V
, the goal of the autocomplete
task is to generate a completion
ˆ
xk+1:N
such that
the resulting sequence (
x1, . . . , xk,ˆxk+1,...,ˆxN
)
resembles a sample from
p
, where
p(x)
denotes
the reference distribution.
x
can be arbitrarily large
(e.g., paragraphs), while
ˆ
xk+1:N
is generally short
(e.g., three words). Each token
xk
can be a word,
character, or subword. The vocabulary
V
contains
unique tokens from the dataset
D
consisting of a
finite set of text sequences from p.
Data. Most datasets in the autocomplete litera-
ture come from domains with focused prompts
(e.g., emails (Chen et al.,2019;Trajanovski et al.,
2021), chat messages (Trajanovski et al.,2021)).
In this work, we target the autocomplete task on
datasets with broad prompts (e.g., Wikipedia) with
a lot of low-frequency prompt patterns (e.g., the
prompt
EACL 2023 conference
). Autocomplete mod-
els trained to answer broad prompts can be used to
assist users in completing documents such as essay,
report, letter, etc.
Metrics. The commonly used metric for evaluat-
ing the quality of an autocomplete model is Ex-
actMatch@N (Rajpurkar et al.,2016) which mea-
sures the percentage of the first
N
words in the
predicted suggestion that exactly match the first
N
words in the ground truth suggestion. Exact-
Match@Overall (Chen et al.,2019) is a weighted
average of the ExactMatch for all subsequence
lengths up to
K
. For our setting, larger n-grams
are increasingly difficult to predict for both word
and character models as shown in Figure 3. Hence
we set
K
to 3. Since the exact match metric strictly
looks for full match of the subsequence, it is a hard
metric to improve on, especially for broad prompts.
One can utilize a less stringent metric such as Par-
tialMatch (Trajanovski et al.,2021). PartialMatch
measures the percentage of characters in the first
N
words in the predicted suggestion that exactly
match those of the ground truth suggestion. How-
ever, PartialMatch might not adequately penalize
for the grammatical incorrectness of the predicted
suggestion. Trajanovski et al. (2021) also utilize
metrics that require interactions from real users,
which are difficult to acquire in practice. Given
that the user-based metrics and PartialMatch met-
ric have a strong correlation with ExactMatch in
all the experiments carried out by Trajanovski et al.
(2021), we use the exact match metric to quantify
the performance of the autocomplete model in this
work. We further perform human evaluation to
compare the naturalness and user acceptability of
the suggestions generated by different models.2
Model. We adopt the Transformer architecture,
specifically Transformer-XL (Dai et al.,2019), for
our autocomplete model. We choose Transformer-
XL for the following two reasons: (i) as Dai et al.
(2019) show, the model achieves strong results
on word and character-based language modeling
benchmarks and (ii) the model can handle long text
sequences (e.g., 1600 word tokens or 3800 charac-
ter tokens) which is crucial for treating arbitrarily
long user inputs (x).
Training. We train a decoder-only, Transformer-
XL model that conditions on user input to generate
the suggestion autoregressively. The parameters
θ
of the autocomplete model
pθ(x)
can be optimized
using the standard language modeling objective.
Inference. During inference, the model
pθ(x)
takes the user input
x1:kp
and generates
the suggestion
ˆ
xk+1:Npθ(.|x1:k)
such that
(
x1, . . . , xk,ˆxk+1,...,ˆxN
) resembles a sample
from
p
. In this work, we choose greedy search
and select the token that receives the highest prob-
ability as the generated token; that is,
ˆxt=
arg max pθ(xt|x1, . . . , xt1)
. As shown in Ap-
pendix A.5 (see Figure 7), beam search performs
poorly on our task and the trends we see in the
next section do not depend on the choice of the
2
For our final comparison, however, we report Partial-
Match vs. ExactMatch (Table 2). We do not experiment
with ranking metrics (e.g., mean reciprocal rank) since our
autocomplete model produces just a single suggestion.
decoding algorithm. For simplicity, we assume the
autocomplete model generates exactly one sugges-
tion ˆ
xk+1:N.
4 Character vs. Word Model
Existing autocomplete models are primarily word-
based, i.e., the representation choice for
xk
is word.
Word-based autocomplete models have the follow-
ing properties: (i) they invest most of the param-
eters (e.g., more than 77%) from the overall pa-
rameter budget on the embedding layer, which is
less likely compressible using standard techniques
such as quantization (Shen et al.,2020) and (ii)
they can memorize high-frequency prompt patterns
and perform well on datasets with focused prompts
(e.g., Reddit posts). In this work, we focus on auto-
completion on broad prompts and we aim to keep
the parameter allocation to the embedding layer
as small as possible thereby improving the overall
memory footprint. To this end, we choose char-
acter as the representation choice and study the
memory-accuracy tradeoff of character based mod-
els on the autocomplete task for broad prompts.
Character-based autocomplete models have several
desirable properties compared to their word based
counterpart, as they (i) invest far fewer parameters
(e.g., less than 4%) of the parameter budget on
the embedding layer and invest most parameters
on other highly compressible Transformer compo-
nents such as self-attention network, feedforward
network, and softmax layer; (ii) perform well on
datasets with broad prompts (as we will show);
and (iii) provide a better tradeoff between accuracy
and memory (model size and peak memory utiliza-
tion). To demonstrate these properties, we perform
extensive experiments on the WikiText-103 bench-
mark (Merity et al.,2017) (unless stated otherwise).
This benchmark contains about
100
M tokens from
Wikipedia to simulate broad prompts. Since we
focus on improving the memory footprint of au-
tocomplete models, we do not experiment with
subword models, which introduce a large number
of token embeddings in the embedding layer (e.g.,
50
K), compared to their character based counter-
part. In other words, we focus only on character
models that keep the parameter allocation to the
embedding layer as small as possible thereby im-
proving the overall memory footprint.
Component-Wise Parameter Breakdown.
Transformer-XL model can be broken down
into four components: (i) adaptive embedding
摘要:

SmallCharacterModelsMatchLargeWordModelsforAutocompleteUnderMemoryConstraintsGaneshJawahar♣∗,SubhabrataMukherjee♠,DebadeeptaDey♠,MuhammadAbdul-Mageed♣♢,LaksV.S.Lakshmanan♣,CaioCesarTeodoroMendes♠,GustavoHenriquedeRosa♠,ShitalShah♠♣UniversityofBritishColumbia,♠Microsoft♢MBZUAIganeshjwhr@gmail.com,{la...

展开>> 收起<<
Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:学术论文 价格:10玖币 属性:16 页 大小:726.57KB 格式:PDF 时间:2025-04-15

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注