Understanding Domain Learning in Language Models Through Subpopulation Analysis Zheng Zhao Yftah Ziser Shay B. Cohen

2025-05-06 0 0 5.6MB 18 页 10玖币
侵权投诉
Understanding Domain Learning in Language Models
Through Subpopulation Analysis
Zheng Zhao Yftah Ziser Shay B. Cohen
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh, EH8 9AB
{zheng.zhao,yftah.ziser}@ed.ac.uk ,scohen@inf.ed.ac.uk
Abstract
We investigate how different domains are en-
coded in modern neural network architectures.
We analyze the relationship between natural
language domains, model size, and the amount
of training data used. The primary analysis
tool we develop is based on subpopulation
analysis with Singular Vector Canonical Cor-
relation Analysis (SVCCA), which we apply
to Transformer-based language models (LMs).
We compare the latent representations of such
a language model at its different layers from
a pair of models: a model trained on multiple
domains (an experimental model) and a model
trained on a single domain (a control model).
Through our method, we find that increasing
the model capacity impacts how domain in-
formation is stored in upper and lower layers
differently. In addition, we show that larger
experimental models simultaneously embed
domain-specific information as if they were
conjoined control models. These findings are
confirmed qualitatively, demonstrating the va-
lidity of our method.
1 Introduction
Pre-trained language models (PLMs) have become
an essential modeling component for state-of-the-
art natural language processing (NLP) models.
They process text into latent representations in such
a way that allows an NLP practitioner to seamlessly
use these representations for prediction problems
of various degrees of difficulty (Wang et al.,2018,
2019). The opaqueness in obtaining these repre-
sentations has been an important research topic in
the NLP community. PLMs, and more generally,
neural models, are currently studied to understand
their process and behavior in obtaining their latent
representations. These PLMs are often trained on
large datasets, with inputs originating from differ-
ent sources. In this paper, we further develop our
understanding of how neural networks obtain their
latent representation and study the effect of learn-
(a) Experimental model (b) Control model
Figure 1: An example of a visualization used with our
subpopulation analysis tool. The experimental model,
which includes all domain data, separates in its latent
representations words related to the Books domain (N)
from general words (). The control model, on the
other hand, mixes them together.
ing from various domains on the characteristics of
the corresponding latent representations.
Texts come from various domains that differ
in their writing styles, authors and topics (Plank,
2016). In this work, we follow a simple defini-
tion of a domain as a corpus of documents shar-
ing a common topic. We rely on a simple tool of
subpopulation analysis to compare and contrast la-
tent representations obtained with and without a
specific domain. Our analysis relies on construct-
ing two types of models: experimental models,
from multi-domain data, and control models, from
single-domain data. Figure 1describes an exam-
ple in which this analysis is applied to study the
way embeddings for domain-specific words clus-
ter together in the experimental and control model
representations.
We believe training in an implicit multi-domain
setup is widespread and often overlooked. For
example, SQuAD (Rajpurkar et al.,2016), a
widely used question-answering dataset composed
of Wikipedia articles from multiple domains, is of-
ten referred to as a single-domain dataset in domain
adaptation works for simplicity (Hazen et al.,2019;
Shakeri et al.,2020;Yue et al.,2021). This sce-
nario is also common in text summarization, where
arXiv:2210.12553v1 [cs.CL] 22 Oct 2022
many datasets consist of a bundle of domains for
news articles (Grusky et al.,2018), academic pa-
pers (Cohan et al.,2018;Fonseca et al.,2022), and
do-it-yourself (DIY) guides (Cohen et al.,2021).
While models that learn from multiple domains
are frequently used, their nature and behavior have
hardly been explored.
Our work sheds light on the way state-of-the-art
multi-domain models encode domain-specific in-
formation. We focus on two main aspects highly
relevant for many training procedures: model ca-
pacity and data size. We discover that model ca-
pacity, indicated by the number of its parameters,
strongly impacts the amount of domain-specific
information multi-domain models store. This prop-
erty might explain the performance gains of larger
models (Devlin et al.,2019;Raffel et al.,2020;
Clark et al.,2020;Srivastava et al.,2022). While
this paper focuses on studying the effect of do-
mains on latent representations, the subpopulation
analysis tool could be used for studying other NLP
setups, such as multitask and multimodal learning.
1
2 Methodology
For an integer
n
, we denote by
[n]
the set
{1, . . . , n}
. Our analysis tool assumes a distri-
bution
p(X)
from which a set of examples
X=
{x(i)|i[n]}
is drawn. It also assumes a fam-
ily of binary indicators
π1, . . . , πd
such that
πi(x)
indicates whether the example
x
satisfies a cer-
tain subpopulation attribute
i
. For example, in this
paper we focus on domain analysis, so
π5
could
indicate if an example belongs to a
Books
domain.
We denote by
X
πi
the set
{x(j)|πi(x(j)) = 1}
,
the subset of
X
that satisfies attribute
i
. Unlike stan-
dard diagnostic classifier methods (Belinkov et al.,
2017a,b;Giulianelli et al.,2018), rather than build-
ing a model to predict the attribute, we perform sub-
population analysis by training a set of models:
E
,
trained from
X
(the experimental model), and
Ci
,
trained from
X
πi
(the control model). We borrow
the terminology of “experimental” and “control”
from experimental design such as in clinical trials
(Hinkelmann and Kempthorne,2007). The experi-
mental model corresponds to the experimental (or
“treatment” in the case of medical trials) group in
such trials and the control model corresponds to
the control group. Unlike a standard experimental
design, rather than comparing a function (such as
1
Our code is available at:
https://github.com/
zsquaredz/subpopulation_analysis
squared difference) between the outcomes of the
two groups to calculate a statistic with an underly-
ing distribution, we instead calculate the similar-
ity values between the representations of the two
models. Our analysis is also related to Representa-
tional Similarity Analysis (Dimsdale-Zucker and
Ranganath,2018), aimed at studying similarities
(across different experimental settings) between
activation levels in brain neurons.
Through their latent representations, the set of
models
Ci
represent the information that is cap-
tured about
p(X)
from the relevant subpopulation
of data. By comparing the different models to each
other, we can learn what information is captured in
the latent representations when a subset of the data
is used and whether this information is different
from the one captured when the whole set of data is
used. With a proper control for model size and sub-
population sizes, we can determine the relationship
between the different attributes
πi
and the corre-
sponding representations in different model com-
ponents. The remaining question now is how do we
compare these representations? Here, we follow
previous work (Saphra and Lopez,2019;Bau et al.,
2019;Kudugunta et al.,2019), and apply Singular
Vector Canonical Correlation Analysis (SVCCA;
Raghu et al. 2017) to the latent representations of
the experimental and control models.
We assume that each example x(i)is associated
with a latent representation
h(i)
j
given by
Cj
. For
example, this could be the representation in the
embedding layer for the input example, or the
representation in the final pre-output layer. We
define
Hj
to be a set of latent representations
Hj={h(k)
j|k[n]}
for model
Cj
. We de-
fine
Hj
πi={h(k)
j|πi(x(k))=1}
– the latent
representations of
Cj
for which attribute
i
fires.
Similarly, we define
H0
for the model
E
. We calcu-
late the SVCCA value between subsets of
H0
and
subsets of
Hj
for
j1
. The procedure of SVCCA
in this case follows:
Performing Singular Value Decomposition
(SVD) on the matrix forms of
H0
and
Hj
(match-
ing the representations in each through the index
of the example
x(i)
from which they originate).
We use the lowest number of principal directions
that preserve 99% of the variance in the data to
project the latent representations.
Performing Canonical Correlation Analysis
(CCA; Hardoon et al. 2004) between the pro-
jections of the latent representations from the
SVD step, and calculating the average correla-
tion value, denoted by ρ0j.
The SVD step, which may seem redundant, is
actually crucial, as it had been shown that low vari-
ance directions in neural network representations
are primarily noise (Raghu et al.,2017;Frankle and
Carbin,2019). The intensity of
ρ0j
indicates the
level of overlap between the latent representations
of each model (Saphra and Lopez,2019).
In the rest of this paper, we use the tool of sub-
population analysis with
E
/
Ci
as above for the case
of domain learning in neural networks. We note
that each time we use this tool, the following deci-
sions need to be made: (a) what training set we use
for each
E
and
Ci
; (b) the subset of
Hj
for
j0
for which we perform the similarity analysis; (c)
the component in the model from which we take
the latent representations. For (c), the component
can be, for example, a layer. Indeed, for most of
our experiments, we use the first and last layer to
create the latent representation sets, as they stand
in stark contrast to each other in their behavior (see
§4). We provide an illustration of our proposed
pipeline in Figure 2. We are particularly interested
in studying the effect of two aspects of learning:
dataset size and model capacity.
The case of domains
In this paper, we define
a domain as a corpus of documents with a com-
mon topic. Since a single massive web-crawled
corpus used to pre-train language models usually
contains many domains, we examine to what ex-
tent domain-specific information is encoded in the
pre-trained model learned on this corpus. Such
domain membership is indicated by our attribute
functions
πi
. For example, we may use
π5(x)
to
indicate whether
x
is an input example from the
domain
Books
. Given this notion of a domain, we
can readily use subpopulation analysis through ex-
perimental and control models to analyze the effect
on neural representations of learning from multiple
domains or a single domain.
3 Experimental Setup
Data
We use the Amazon Reviews dataset (Ni
et al.,2019), a dataset that facilitates research
in tasks like sentiment analysis (Zhang et al.,
2020), aspect-based sentiment analysis, and rec-
ommendation systems (Wang et al.,2020). The
reviews in this dataset are explicitly divided into
Experimental
model
Training set
Control
model
Control
model
Control
model
Control
model
Constrained
training set
Control
model
Experimental
model
Control
model
Test set SVCCA
value
Figure 2: A diagram explaining the analysis we per-
form. At the top, during training, we create two sets
of models from constrained datasets (based on differ-
ent πi) and a dataset that is not constrained. The result
of this training is two set of models, the experimental
model (E) and control models (Ci). To perform the
similarity analysis, we compute latent representation
from a common test set for both models, and then run
SVCCA (bottom).
different product categories that serve as domains,
which makes it a natural testbed for many multi-
domain studies. A noteworthy example of a re-
search field that heavily relies on this dataset is
domain adaptation (Blitzer et al.,2007;Ziser and
Reichart,2018;Du et al.,2020;Lekhtman et al.,
2021;Long et al.,2022), which is the task of
learning robust models across different domains,
closely related to our research.
2
We sort the do-
mains by their review counts and pick the top five,
which results in:
Books
,
Clothing Shoes
and Jewelry
,
Electronics
,
Home and
Kitchen
, and
Movies and TV
domains. To
further validate our data quality, we use the 5-core
subset of the data, ensuring that all reviewed items
have at least five reviews authored by reviewers
who wrote at least five reviews.
A representative dataset sample is presented in
Table 1. We consider the different domains within
the Amazon review dataset as lexical domains, i.e.,
domains that share a similar textual structure and
functionality but differ with respect to their vocab-
ulary. For example, we see the review snippet from
the
Books
domain contains an aspect (“ending”)
for which a negative sentiment is conveyed (“didn’t
2
We use the latest version of the dataset, consisting reviews
from 1996 up to 2019.
Books
: . . . the book didn’t have a proper ending but rather
a rushed attempt to conclude the story and put everyone
away neatly . . .
Clothing
: . . . clearly of awful quality, the design and paint
was totally wrong, the mask was short and stumpy as well
as slightly deformed and bent to the left . . .
Home
: . . . there are no handles, and the plastic gets too
hot to hold, so you have to awkwardly pour by the top . . .
Table 1: A representative sample of review snippets.
have a proper”). Similarly, we find an aspect (“han-
dle”) with a corresponding conveyed sentiment
(“too hot”) for the
Home
domain. We can see this
shared pattern across all domains, with different
aspects and sentiment terms. We would not expect
this to be the case for other datasets, which might
have different differentiators for domains. For ex-
ample, Amazon reviews and Wikipedia pages on
Books
domain may have a similar vocabulary,
however, a review is more likely to convey sen-
timent toward a particular book, and a Wikipedia
article is more likely to focus on describing the
book. Thus, the Amazon Reviews dataset is an
ideal testbed for our analysis.
In addition to the Amazon Reviews dataset, we
experimented on the WikiSum dataset (Cohen et al.,
2021) to further validate our findings. The Wik-
iSum dataset is a coherent paragraph summariza-
tion dataset based on the WikiHow website.
3
Wiki-
How consists of do-it-yourself (DIY) guides for the
general public, thus is written using simple English
and ranges over many domains. Similar to Ama-
zon Reviews, we also pick the top five domains for
our experiments:
Education
,
Food
,
Health
,
Home
, and
Pets
. Since the dataset is designed for
summarization, we concatenate the document and
summary together for our MLM task. We present
the results for this dataset at the end of § 4.
Task
We study the language modeling task to un-
derstand the nature of multi-domain learning better.
More precisely, we experiment with the masked
language modeling (MLM) task, which randomly
masks some of the tokens from the input, then pre-
dicts the masked word based on the context as the
training objective. We focus on the MLM task as it
is a prevalent pre-training task for many standard
models such as BERT (Devlin et al.,2019) and
RoBERTa (Liu et al.,2019) that serve as building
blocks for many downstream tasks. Using exam-
ples from a set of pre-defined domains, we train a
3https://www.wikihow.com
BERT model from scratch to fully control our ex-
periment and isolate the effect of different domains.
This is crucial since a pre-trained BERT model is
already trained on multiple domains, hence hard to
drive correct conclusions through our analysis from
such a model. Moreover, recent studies (Magar and
Schwartz,2022;Brown et al.,2020) showed the
risk of exposure of large language models to test
data in the pre-training phase, also known as data
contamination.
Model
We use the BERT
BASE
(Devlin et al.,
2019) architecture for all of our experiments. We
train two types of models: the experimental model
E
, trained on all five domains with the MLM objec-
tive, and the control model
Ci
for
i[5]
trained
on the
i
th domain. We are particularly interested
in the effect of two aspects on the model represen-
tation: model capacity and data size. We use the
capacity of 100% for BERT
BASE
size. BERT
BASE
has 768-dimensional vectors for each layer, adding
up to a total of 110M parameters. We also experi-
ment with a reduced model capacity of 75%, 50%,
25%, and 10% by reducing the dimension of the
hidden layers. We follow Devlin et al. (2019) de-
sign choices, e.g., 12 layers with 12 attention heads
per layer. We set the base training data size (100%)
for
E
to be 50K, composed of 10K reviews per
domain. Each
Ci
is trained on single domain data
containing 10K reviews.
E
and
Ci
share all the
examples of domain
i
. To study the effect of data
size on model representation, we take subsets from
the data split and create smaller datasets: a 10%
split and a 50% split. We also create a 200% split to
simulate the case with abundant data. We provide
additional details about our training procedure in
Appendix A.
4 Experiments and Results
Our research questions (RQs) examine how
domain-specific information is encoded in
E
by
calculating its SVCCA score with
Ci
for a specific
i
. For a given domain, we use a held-out test set
for getting the experimental and control model rep-
resentations as an input for the SVCCA method.
Intuitively, a high SVCCA score between
E
and
Ci
indicates
E
stores domain-specific information
for domain
i
, as
Ci
was train solely on data from
domain
i
. A low SVCCA score between
E
and
Ci
could mean one of two things: a)
E
can gen-
eralize to data from
di
without explicitly storing
domain-specific information about it, or b)
E
can
摘要:

UnderstandingDomainLearninginLanguageModelsThroughSubpopulationAnalysisZhengZhaoYftahZiserShayB.CohenInstituteforLanguage,CognitionandComputationSchoolofInformatics,UniversityofEdinburgh10CrichtonStreet,Edinburgh,EH89AB{zheng.zhao,yftah.ziser}@ed.ac.uk,scohen@inf.ed.ac.ukAbstractWeinvestigatehowdiff...

展开>> 收起<<
Understanding Domain Learning in Language Models Through Subpopulation Analysis Zheng Zhao Yftah Ziser Shay B. Cohen.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:5.6MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注