Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints

2025-04-15 0 0 726.57KB 16 页 10玖币

侵权投诉

Small Character Models Match Large Word Models for Autocomplete

Under Memory Constraints

Ganesh Jawahar♣∗

, Subhabrata Mukherjee♠, Debadeepta Dey♠,

Muhammad Abdul-Mageed♣♢, Laks V.S. Lakshmanan♣, Caio Cesar Teodoro Mendes♠,

Gustavo Henrique de Rosa♠, Shital Shah♠

♣University of British Columbia, ♠Microsoft ♢MBZUAI

ganeshjwhr@gmail.com, {laks,amuham01}@cs.ubc.ca,

{Subhabrata.Mukherjee,dedey,caiocesart,gderosa,shitals}@microsoft.com

Abstract

Autocomplete is a task where the user inputs

a piece of text, termed prompt, which is con-

ditioned by the model to generate semanti-

cally coherent continuation. Existing works

for this task have primarily focused on datasets

(e.g., email, chat) with high frequency user

prompt patterns (or focused prompts) where

word-based language models have been quite

effective. In this work, we study the more chal-

lenging setting consisting of low frequency user

prompt patterns (or broad prompts, e.g., prompt

about

93rd academy awards

) and demonstrate

the effectiveness of character-based language

models. We study this problem under memory-

constrained settings (e.g., edge devices and

smartphones), where character-based represen-

tation is effective in reducing the overall model

size (in terms of parameters). We use WikiText-

103 benchmark to simulate broad prompts and

demonstrate that character models rival word

models in exact match accuracy for the auto-

complete task, when controlled for the model

size. For instance, we show that a

M pa-

rameter character model performs similar to

M parameter word model in the vanilla

setting. We further propose novel methods to

improve character models by incorporating in-

ductive bias in the form of compositional infor-

mation and representation transfer from large

word models. Datasets and code used in this

work are available at

https://github.com/

UBC-NLP/char_autocomplete.

1 Introduction

Autocomplete models are conditioned on user-

written prompts or text to generate semantically

coherent continuations. For example, given the

user input “

Filmmaker George Lucas used Tikal as

”, a semantically coherent continuation can

be “

filming location

” (Example 1). Autocomplete

models can dramatically reduce keystrokes and im-

prove user’s productivity in a wide range of appli-

∗Part of work was done as an intern in Microsoft.

cations including email, chat and document author-

ing. Some typical challenges in building a real-

time autocomplete model include: (i) processing

arbitrary length user input (e.g., paragraphs), (ii)

handling low frequency user prompt patterns (or

broad prompts that typically cover wider vocabu-

lary (as in Example 1), and (iii) satisfying memory

constraints of the target device (such as cap on peak

memory utilization).

Despite the importance of the task, there has

been limited research on autocomplete. Existing

works such as Smart Compose (Chen et al.,2019)

and (Trajanovski et al.,2021) train autoregressive

language models on emails and chats, where user

prompt patterns tend to be high-frequency. That

is, the prompts are focused prompts, e.g., a prompt

about

office standups

, that typically cover narrower

vocabulary. All these models are trained at word

level, which leads to two issues: (i) input/output

embedding parameters (less compressible compo-

nent of the Transformer model (Shen et al.,2020)

)

occupy a signiﬁcant share (e.g., more than 77%) of

the parameter budget due to the large vocabulary

size and (ii) tendency to memorize high-frequency

prompt patterns resulting in poor generalization on

the low-frequency ones.

n-gram unigram bigram trigram

Wikitext-103 95.44 84.35 60.63

Reddit 86.41 77.04 54.36

Table 1: Percentage of unique out of vocabulary (OOV)

n-grams in test set of WikiText-103 (broad prompts) vs.

Reddit (focused prompts) datasets.

In this paper, we focus on the autocomplete

task of broad prompts from domains such as

Wikipedia, where user prompt patterns often have

Shen et al. (2020) study the effects of quantization on

different components of Transformer model, on the perfor-

mance in various NLP tasks. They ﬁnd that the embedding

layer is most sensitive to quantization than other components

and requires more bits to keep performance loss acceptable.

arXiv:2210.03251v2 [cs.CL] 7 Jun 2023

low frequency (e.g., prompt about

93rd academy

awards

). For instance, from Table 1, we observe

that WikiText-103 (broad prompts) contains at least

10%

more unique out of vocabulary (OOV) n-

grams compared to the Reddit dataset (focused

prompts). This makes our task more challenging

than conventional settings considered in prior work

which do one of the following: (i) adopt word-

based models that are good at memorizing high-

frequency patterns for focused prompts or (ii) rely

on conventional language modeling which is not

geared for generating precise and short horizon

continuations (see Section 4).

Furthermore, we study this problem for prac-

tical applications under memory-constrained set-

tings. Lower-end edge platforms (e.g., Raspberry

Pi with 256MB of memory (Cai et al.,2020)) have

memory constraints that are more limiting than la-

tency constraints, for supporting various on-device

models. Also, given that autoregressive language

models are memory-bounded (Wang et al.,2021),

we focus on improving the accuracy-memory trade-

off for autocomplete task of broad prompts. Our

work is complementary to existing works in model

compression including those on pruning (Gordon

et al.,2020), quantization (Han et al.,2016) and

distillation (Sanh et al.,2019) that primarily fo-

cus on natural language understanding tasks (e.g.,

text classiﬁcation). In contrast to these works, we

study the effectiveness of character-based language

models for a natural language generation task (e.g.,

autocomplete).

In this paper, we focus on two research ques-

tions. RQ1: How do character-based autocom-

plete models compare against word counterparts

under memory constraints? RQ2: How to improve

character-based autocomplete models with no neg-

ative impact on memory? We answer RQ1 by

showing that compared to word models, character

models (i) contribute

96%

fewer parameters in the

embedding layer due to a much smaller vocabulary,

(ii) work well on low-frequency (or broad) prompt

patterns (e.g.,

% accuracy improvement by us-

ing

M character model over

M word model,

see Figure 2(a)) and (iii) result in high savings

on peak memory utilization (e.g.,

4.7

% memory

savings by using

M character model over

word model, see Figure 2(b)). When controlled

for model size (number of parameters), we ﬁnd

that smaller character models (e.g.,

M parame-

ters) perform similar to large word models (e.g.,

M parameters). We answer RQ2 by developing

novel methods to improve the accuracy of char-

acter models, which unlike previous work, have

minimal impact on memory usage. These methods

introduce inductive bias in the form of composi-

tional information and representation transfer from

large word models (best method). We show that

the best method achieves

1.12

% and

27.3

% relative

accuracy improvements over vanilla character and

vanilla word models respectively with no impact

on memory usage. We discuss the limitations of

our work in Section 8and defer the analysis of

accuracy-latency trade-off to future work while fo-

cusing only on memory-constrained settings in this

work.

Our major contributions are as follows: (1) To

the best of our knowledge, this is the ﬁrst study

of the autocomplete task for broad prompts in a

memory-constrained setting. (2) We perform an

extensive comparison of character and word mod-

els across diverse architectures and demonstrate

the advantage of character models over large word

models for the autocomplete task on dimensions

like peak memory utilization and model parame-

ters. (3) We introduce novel methods leveraging

inductive bias to further improve the accuracy of

character models with minimal impact on memory

usage.

2 Related Work

Our work leverages advances in neural language

models, autocompletion, and efﬁcient deep learn-

ing.

Neural Language Models. The autocomplete

models we study in this work utilize Transformer-

based (Vaswani et al.,2017) autoregressive neural

language models as backbone. Compared to word

models, character models lag behind in language

modeling performance when controlled for model

size (Al-Rfou et al.,2019;Choe et al.,2019) and

have a high computational complexity due to long

sequence length (Tay et al.,2022). In this work,

we focus on deploying models on lower-end edge

platforms (e.g., Raspberry Pi) where memory, as

opposed to latency, is the major bottleneck.

Autocomplete Task. Despite the pervasiveness

of autocomplete models, there is limited research

in the academic community on the autocomplete

task. Gmail Smart Compose (Chen et al.,2019) is a

popular word-based autocomplete model for email

suggestions. They ﬁnd the encoder-decoder archi-

tecture to have a higher latency than the decoder-

only architecture. They also ﬁnd the Transformer

architecture to be marginally better than the LSTM

architecture (Hochreiter and Schmidhuber,1997).

Motivated by these ﬁndings, we employ a decoder-

only, Transformer based architecture for building

our autocomplete model. Trajanovski et al. (2021)

leverage word-based autocomplete models for pro-

viding email and chat suggestions.

In this work, we focus on building autocomplete

models for broad prompts from domains such as

Wikipedia, where user prompt patterns can be quite

low frequency (e.g., prompt about

Bruce Vilanch

(Oscars writer), with frequency of only 6 times).

Unlike our prompt completion task, query auto-

completion task is a well researched problem (Bar-

Yossef and Kraus,2011;Cai and de Rijke,2016;

Wang et al.,2020;Gog et al.,2020), where the

goal is to complete the user’s query, e.g., search

query. Since user queries are generally short, query

autocomplete models need not track long-range

dependencies to understand the user’s intent. In

contrast, it is a requirement in our prompt comple-

tion setting, as the user prompt can be arbitrarily

large, e.g., sentences or paragraphs.

ChatGPT (OpenAI,2023b) and GPT-4 (OpenAI,

2023a) are recent dialogue models, which have gar-

nered a great attention from the AI community for

their ability to converse with human-like capabil-

ities. The data used to train these models are not

disclosed by the authors. As it is entirely possi-

ble for their training data to include the test sets

we study in our work and train-test overlap anal-

ysis cannot be performed, we cannot make a fair

comparison of our work with these ‘closed’ AI

models (Rogers et al.,2023). Models such as Al-

paca (Taori et al.,2023), Vicuna (Chiang et al.,

2023), GPT-4-LLM (Peng et al.,2023) that claim

to perform similarly as ChatGPT with few billion

parameters are usually ﬁnetuned with outputs from

ChatGPT or GPT-4. Hence, these models cannot

be fairly compared with our work either.

Efﬁcient Deep Learning. Exponential growth in

the size of Transformer-based autoregressive lan-

guage models (e.g.,

175

B (Brown et al.,2020)) has

given rise to a strong need to make these models

efﬁcient so they can be used on commodity de-

vices like laptop, tablet, and mobile, which have

various resource constraints such as peak memory

utilization and latency, while yielding the best per-

formance under the constraints. To this end, there

has been extensive research on building efﬁcient

Transformer models that are smaller, faster, and bet-

ter, as summarized thoroughly by Tay et al. (2020)

and Menghani (2021). Our work is focused on im-

proving the efﬁciency of a natural language gener-

ation task (e.g., autocomplete), which has received

less attention from an efﬁciency perspective. Wang

et al. (2021) observe that 73% of the overall latency

of autoregressive language models goes to memory

intensive data movement operations (e.g., splitting

heads, transpose, reshape) and conclude that these

models are memory intensive. Since lower-end

edge platforms have tighter memory constraints

than latency constraints (Cai et al.,2020), we fo-

cus on improving the accuracy-memory trade-off

of autocomplete models.

3 Autocomplete – Fundamentals

Problem. Given a text sequence

(x1, . . . , x|x|)

(user input) with tokens from a ﬁxed

vocabulary

xi∈ V

, the goal of the autocomplete

task is to generate a completion

xk+1:N

such that

the resulting sequence (

x1, . . . , xk,ˆxk+1,...,ˆxN

)

resembles a sample from

p∗

, where

p∗(x)

denotes

the reference distribution.

can be arbitrarily large

(e.g., paragraphs), while

xk+1:N

is generally short

(e.g., three words). Each token

can be a word,

character, or subword. The vocabulary

contains

unique tokens from the dataset

consisting of a

ﬁnite set of text sequences from p∗.

Data. Most datasets in the autocomplete litera-

ture come from domains with focused prompts

(e.g., emails (Chen et al.,2019;Trajanovski et al.,

2021), chat messages (Trajanovski et al.,2021)).

In this work, we target the autocomplete task on

datasets with broad prompts (e.g., Wikipedia) with

a lot of low-frequency prompt patterns (e.g., the

prompt

EACL 2023 conference

). Autocomplete mod-

els trained to answer broad prompts can be used to

assist users in completing documents such as essay,

report, letter, etc.

Metrics. The commonly used metric for evaluat-

ing the quality of an autocomplete model is Ex-

actMatch@N (Rajpurkar et al.,2016) which mea-

sures the percentage of the ﬁrst

words in the

predicted suggestion that exactly match the ﬁrst

words in the ground truth suggestion. Exact-

Match@Overall (Chen et al.,2019) is a weighted

average of the ExactMatch for all subsequence

lengths up to

. For our setting, larger n-grams

are increasingly difﬁcult to predict for both word

and character models as shown in Figure 3. Hence

we set

to 3. Since the exact match metric strictly

looks for full match of the subsequence, it is a hard

metric to improve on, especially for broad prompts.

One can utilize a less stringent metric such as Par-

tialMatch (Trajanovski et al.,2021). PartialMatch

measures the percentage of characters in the ﬁrst

words in the predicted suggestion that exactly

match those of the ground truth suggestion. How-

ever, PartialMatch might not adequately penalize

for the grammatical incorrectness of the predicted

suggestion. Trajanovski et al. (2021) also utilize

metrics that require interactions from real users,

which are difﬁcult to acquire in practice. Given

that the user-based metrics and PartialMatch met-

ric have a strong correlation with ExactMatch in

all the experiments carried out by Trajanovski et al.

(2021), we use the exact match metric to quantify

the performance of the autocomplete model in this

work. We further perform human evaluation to

compare the naturalness and user acceptability of

the suggestions generated by different models.2

Model. We adopt the Transformer architecture,

speciﬁcally Transformer-XL (Dai et al.,2019), for

our autocomplete model. We choose Transformer-

XL for the following two reasons: (i) as Dai et al.

(2019) show, the model achieves strong results

on word and character-based language modeling

benchmarks and (ii) the model can handle long text

sequences (e.g., 1600 word tokens or 3800 charac-

ter tokens) which is crucial for treating arbitrarily

long user inputs (x).

Training. We train a decoder-only, Transformer-

XL model that conditions on user input to generate

the suggestion autoregressively. The parameters

of the autocomplete model

pθ(x)

can be optimized

using the standard language modeling objective.

Inference. During inference, the model

pθ(x)

takes the user input

x1:k∼p∗

and generates

the suggestion

xk+1:N∼pθ(.|x1:k)

such that

(

x1, . . . , xk,ˆxk+1,...,ˆxN

) resembles a sample

from

p∗

. In this work, we choose greedy search

and select the token that receives the highest prob-

ability as the generated token; that is,

ˆxt=

arg max pθ(xt|x1, . . . , xt−1)

. As shown in Ap-

pendix A.5 (see Figure 7), beam search performs

poorly on our task and the trends we see in the

next section do not depend on the choice of the

For our ﬁnal comparison, however, we report Partial-

Match vs. ExactMatch (Table 2). We do not experiment

with ranking metrics (e.g., mean reciprocal rank) since our

autocomplete model produces just a single suggestion.

decoding algorithm. For simplicity, we assume the

autocomplete model generates exactly one sugges-

tion ˆ

xk+1:N.

4 Character vs. Word Model

Existing autocomplete models are primarily word-

based, i.e., the representation choice for

is word.

Word-based autocomplete models have the follow-

ing properties: (i) they invest most of the param-

eters (e.g., more than 77%) from the overall pa-

rameter budget on the embedding layer, which is

less likely compressible using standard techniques

such as quantization (Shen et al.,2020) and (ii)

they can memorize high-frequency prompt patterns

and perform well on datasets with focused prompts

(e.g., Reddit posts). In this work, we focus on auto-

completion on broad prompts and we aim to keep

the parameter allocation to the embedding layer

as small as possible thereby improving the overall

memory footprint. To this end, we choose char-

acter as the representation choice and study the

memory-accuracy tradeoff of character based mod-

els on the autocomplete task for broad prompts.

Character-based autocomplete models have several

desirable properties compared to their word based

counterpart, as they (i) invest far fewer parameters

(e.g., less than 4%) of the parameter budget on

the embedding layer and invest most parameters

on other highly compressible Transformer compo-

nents such as self-attention network, feedforward

network, and softmax layer; (ii) perform well on

datasets with broad prompts (as we will show);

and (iii) provide a better tradeoff between accuracy

and memory (model size and peak memory utiliza-

tion). To demonstrate these properties, we perform

extensive experiments on the WikiText-103 bench-

mark (Merity et al.,2017) (unless stated otherwise).

This benchmark contains about

100

M tokens from

Wikipedia to simulate broad prompts. Since we

focus on improving the memory footprint of au-

tocomplete models, we do not experiment with

subword models, which introduce a large number

of token embeddings in the embedding layer (e.g.,

K), compared to their character based counter-

part. In other words, we focus only on character

models that keep the parameter allocation to the

embedding layer as small as possible thereby im-

proving the overall memory footprint.

Component-Wise Parameter Breakdown.

Transformer-XL model can be broken down

into four components: (i) adaptive embedding

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SmallCharacterModelsMatchLargeWordModelsforAutocompleteUnderMemoryConstraintsGaneshJawahar♣∗,SubhabrataMukherjee♠,DebadeeptaDey♠,MuhammadAbdul-Mageed♣♢,LaksV.S.Lakshmanan♣,CaioCesarTeodoroMendes♠,GustavoHenriquedeRosa♠,ShitalShah♠♣UniversityofBritishColumbia,♠Microsoft♢MBZUAIganeshjwhr@gmail.com,{la...

展开>> 收起<<

Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: