Understanding Domain Learning in Language Models Through Subpopulation Analysis Zheng Zhao Yftah Ziser Shay B. Cohen

2025-05-06 0 0 5.6MB 18 页 10玖币

侵权投诉

Understanding Domain Learning in Language Models

Through Subpopulation Analysis

Zheng Zhao Yftah Ziser Shay B. Cohen

Institute for Language, Cognition and Computation

School of Informatics, University of Edinburgh

10 Crichton Street, Edinburgh, EH8 9AB

{zheng.zhao,yftah.ziser}@ed.ac.uk ,scohen@inf.ed.ac.uk

Abstract

We investigate how different domains are en-

coded in modern neural network architectures.

We analyze the relationship between natural

language domains, model size, and the amount

of training data used. The primary analysis

tool we develop is based on subpopulation

analysis with Singular Vector Canonical Cor-

relation Analysis (SVCCA), which we apply

to Transformer-based language models (LMs).

We compare the latent representations of such

a language model at its different layers from

a pair of models: a model trained on multiple

domains (an experimental model) and a model

trained on a single domain (a control model).

Through our method, we ﬁnd that increasing

the model capacity impacts how domain in-

formation is stored in upper and lower layers

differently. In addition, we show that larger

experimental models simultaneously embed

domain-speciﬁc information as if they were

conjoined control models. These ﬁndings are

conﬁrmed qualitatively, demonstrating the va-

lidity of our method.

1 Introduction

Pre-trained language models (PLMs) have become

an essential modeling component for state-of-the-

art natural language processing (NLP) models.

They process text into latent representations in such

a way that allows an NLP practitioner to seamlessly

use these representations for prediction problems

of various degrees of difﬁculty (Wang et al.,2018,

2019). The opaqueness in obtaining these repre-

sentations has been an important research topic in

the NLP community. PLMs, and more generally,

neural models, are currently studied to understand

their process and behavior in obtaining their latent

representations. These PLMs are often trained on

large datasets, with inputs originating from differ-

ent sources. In this paper, we further develop our

understanding of how neural networks obtain their

latent representation and study the effect of learn-

(a) Experimental model (b) Control model

Figure 1: An example of a visualization used with our

subpopulation analysis tool. The experimental model,

which includes all domain data, separates in its latent

representations words related to the Books domain (N)

from general words (). The control model, on the

other hand, mixes them together.

ing from various domains on the characteristics of

the corresponding latent representations.

Texts come from various domains that differ

in their writing styles, authors and topics (Plank,

2016). In this work, we follow a simple deﬁni-

tion of a domain as a corpus of documents shar-

ing a common topic. We rely on a simple tool of

subpopulation analysis to compare and contrast la-

tent representations obtained with and without a

speciﬁc domain. Our analysis relies on construct-

ing two types of models: experimental models,

from multi-domain data, and control models, from

single-domain data. Figure 1describes an exam-

ple in which this analysis is applied to study the

way embeddings for domain-speciﬁc words clus-

ter together in the experimental and control model

representations.

We believe training in an implicit multi-domain

setup is widespread and often overlooked. For

example, SQuAD (Rajpurkar et al.,2016), a

widely used question-answering dataset composed

of Wikipedia articles from multiple domains, is of-

ten referred to as a single-domain dataset in domain

adaptation works for simplicity (Hazen et al.,2019;

Shakeri et al.,2020;Yue et al.,2021). This sce-

nario is also common in text summarization, where

arXiv:2210.12553v1 [cs.CL] 22 Oct 2022

many datasets consist of a bundle of domains for

news articles (Grusky et al.,2018), academic pa-

pers (Cohan et al.,2018;Fonseca et al.,2022), and

do-it-yourself (DIY) guides (Cohen et al.,2021).

While models that learn from multiple domains

are frequently used, their nature and behavior have

hardly been explored.

Our work sheds light on the way state-of-the-art

multi-domain models encode domain-speciﬁc in-

formation. We focus on two main aspects highly

relevant for many training procedures: model ca-

pacity and data size. We discover that model ca-

pacity, indicated by the number of its parameters,

strongly impacts the amount of domain-speciﬁc

information multi-domain models store. This prop-

erty might explain the performance gains of larger

models (Devlin et al.,2019;Raffel et al.,2020;

Clark et al.,2020;Srivastava et al.,2022). While

this paper focuses on studying the effect of do-

mains on latent representations, the subpopulation

analysis tool could be used for studying other NLP

setups, such as multitask and multimodal learning.

2 Methodology

For an integer

, we denote by

[n]

the set

{1, . . . , n}

. Our analysis tool assumes a distri-

bution

p(X)

from which a set of examples

{x(i)|i∈[n]}

is drawn. It also assumes a fam-

ily of binary indicators

π1, . . . , πd

such that

πi(x)

indicates whether the example

satisﬁes a cer-

tain subpopulation attribute

. For example, in this

paper we focus on domain analysis, so

π5

could

indicate if an example belongs to a

Books

domain.

We denote by

X

πi

the set

{x(j)|πi(x(j)) = 1}

the subset of

that satisﬁes attribute

. Unlike stan-

dard diagnostic classiﬁer methods (Belinkov et al.,

2017a,b;Giulianelli et al.,2018), rather than build-

ing a model to predict the attribute, we perform sub-

population analysis by training a set of models:

trained from

(the experimental model), and

trained from

X

πi

(the control model). We borrow

the terminology of “experimental” and “control”

from experimental design such as in clinical trials

(Hinkelmann and Kempthorne,2007). The experi-

mental model corresponds to the experimental (or

“treatment” in the case of medical trials) group in

such trials and the control model corresponds to

the control group. Unlike a standard experimental

design, rather than comparing a function (such as

Our code is available at:

https://github.com/

zsquaredz/subpopulation_analysis

squared difference) between the outcomes of the

two groups to calculate a statistic with an underly-

ing distribution, we instead calculate the similar-

ity values between the representations of the two

models. Our analysis is also related to Representa-

tional Similarity Analysis (Dimsdale-Zucker and

Ranganath,2018), aimed at studying similarities

(across different experimental settings) between

activation levels in brain neurons.

Through their latent representations, the set of

models

represent the information that is cap-

tured about

p(X)

from the relevant subpopulation

of data. By comparing the different models to each

other, we can learn what information is captured in

the latent representations when a subset of the data

is used and whether this information is different

from the one captured when the whole set of data is

used. With a proper control for model size and sub-

population sizes, we can determine the relationship

between the different attributes

πi

and the corre-

sponding representations in different model com-

ponents. The remaining question now is how do we

compare these representations? Here, we follow

previous work (Saphra and Lopez,2019;Bau et al.,

2019;Kudugunta et al.,2019), and apply Singular

Vector Canonical Correlation Analysis (SVCCA;

Raghu et al. 2017) to the latent representations of

the experimental and control models.

We assume that each example x(i)is associated

with a latent representation

h(i)

given by

. For

example, this could be the representation in the

embedding layer for the input example, or the

representation in the ﬁnal pre-output layer. We

deﬁne

to be a set of latent representations

Hj={h(k)

j|k∈[n]}

for model

. We de-

ﬁne

Hj

πi={h(k)

j|πi(x(k))=1}

– the latent

representations of

for which attribute

ﬁres.

Similarly, we deﬁne

for the model

. We calcu-

late the SVCCA value between subsets of

and

subsets of

for

j≥1

. The procedure of SVCCA

in this case follows:

•

Performing Singular Value Decomposition

(SVD) on the matrix forms of

and

(match-

ing the representations in each through the index

of the example

x(i)

from which they originate).

We use the lowest number of principal directions

that preserve 99% of the variance in the data to

project the latent representations.

•

Performing Canonical Correlation Analysis

(CCA; Hardoon et al. 2004) between the pro-

jections of the latent representations from the

SVD step, and calculating the average correla-

tion value, denoted by ρ0j.

The SVD step, which may seem redundant, is

actually crucial, as it had been shown that low vari-

ance directions in neural network representations

are primarily noise (Raghu et al.,2017;Frankle and

Carbin,2019). The intensity of

ρ0j

indicates the

level of overlap between the latent representations

of each model (Saphra and Lopez,2019).

In the rest of this paper, we use the tool of sub-

population analysis with

as above for the case

of domain learning in neural networks. We note

that each time we use this tool, the following deci-

sions need to be made: (a) what training set we use

for each

and

; (b) the subset of

for

j≥0

for which we perform the similarity analysis; (c)

the component in the model from which we take

the latent representations. For (c), the component

can be, for example, a layer. Indeed, for most of

our experiments, we use the ﬁrst and last layer to

create the latent representation sets, as they stand

in stark contrast to each other in their behavior (see

§4). We provide an illustration of our proposed

pipeline in Figure 2. We are particularly interested

in studying the effect of two aspects of learning:

dataset size and model capacity.

The case of domains

In this paper, we deﬁne

a domain as a corpus of documents with a com-

mon topic. Since a single massive web-crawled

corpus used to pre-train language models usually

contains many domains, we examine to what ex-

tent domain-speciﬁc information is encoded in the

pre-trained model learned on this corpus. Such

domain membership is indicated by our attribute

functions

πi

. For example, we may use

π5(x)

indicate whether

is an input example from the

domain

Books

. Given this notion of a domain, we

can readily use subpopulation analysis through ex-

perimental and control models to analyze the effect

on neural representations of learning from multiple

domains or a single domain.

3 Experimental Setup

Data

We use the Amazon Reviews dataset (Ni

et al.,2019), a dataset that facilitates research

in tasks like sentiment analysis (Zhang et al.,

2020), aspect-based sentiment analysis, and rec-

ommendation systems (Wang et al.,2020). The

reviews in this dataset are explicitly divided into

Experimental

model

Training set

Control

model

Control

model

Control

model

Control

model

Constrained

training set

Control

model

Experimental

model

Control

model

Test set SVCCA

value

Figure 2: A diagram explaining the analysis we per-

form. At the top, during training, we create two sets

of models from constrained datasets (based on differ-

ent πi) and a dataset that is not constrained. The result

of this training is two set of models, the experimental

model (E) and control models (Ci). To perform the

similarity analysis, we compute latent representation

from a common test set for both models, and then run

SVCCA (bottom).

different product categories that serve as domains,

which makes it a natural testbed for many multi-

domain studies. A noteworthy example of a re-

search ﬁeld that heavily relies on this dataset is

domain adaptation (Blitzer et al.,2007;Ziser and

Reichart,2018;Du et al.,2020;Lekhtman et al.,

2021;Long et al.,2022), which is the task of

learning robust models across different domains,

closely related to our research.

We sort the do-

mains by their review counts and pick the top ﬁve,

which results in:

Books

Clothing Shoes

and Jewelry

Electronics

Home and

Kitchen

, and

Movies and TV

domains. To

further validate our data quality, we use the 5-core

subset of the data, ensuring that all reviewed items

have at least ﬁve reviews authored by reviewers

who wrote at least ﬁve reviews.

A representative dataset sample is presented in

Table 1. We consider the different domains within

the Amazon review dataset as lexical domains, i.e.,

domains that share a similar textual structure and

functionality but differ with respect to their vocab-

ulary. For example, we see the review snippet from

the

Books

domain contains an aspect (“ending”)

for which a negative sentiment is conveyed (“didn’t

We use the latest version of the dataset, consisting reviews

from 1996 up to 2019.

Books

: . . . the book didn’t have a proper ending but rather

a rushed attempt to conclude the story and put everyone

away neatly . . .

Clothing

: . . . clearly of awful quality, the design and paint

was totally wrong, the mask was short and stumpy as well

as slightly deformed and bent to the left . . .

Home

: . . . there are no handles, and the plastic gets too

hot to hold, so you have to awkwardly pour by the top . . .

Table 1: A representative sample of review snippets.

have a proper”). Similarly, we ﬁnd an aspect (“han-

dle”) with a corresponding conveyed sentiment

(“too hot”) for the

Home

domain. We can see this

shared pattern across all domains, with different

aspects and sentiment terms. We would not expect

this to be the case for other datasets, which might

have different differentiators for domains. For ex-

ample, Amazon reviews and Wikipedia pages on

Books

domain may have a similar vocabulary,

however, a review is more likely to convey sen-

timent toward a particular book, and a Wikipedia

article is more likely to focus on describing the

book. Thus, the Amazon Reviews dataset is an

ideal testbed for our analysis.

In addition to the Amazon Reviews dataset, we

experimented on the WikiSum dataset (Cohen et al.,

2021) to further validate our ﬁndings. The Wik-

iSum dataset is a coherent paragraph summariza-

tion dataset based on the WikiHow website.

Wiki-

How consists of do-it-yourself (DIY) guides for the

general public, thus is written using simple English

and ranges over many domains. Similar to Ama-

zon Reviews, we also pick the top ﬁve domains for

our experiments:

Education

Food

Health

Home

, and

Pets

. Since the dataset is designed for

summarization, we concatenate the document and

summary together for our MLM task. We present

the results for this dataset at the end of § 4.

Task

We study the language modeling task to un-

derstand the nature of multi-domain learning better.

More precisely, we experiment with the masked

language modeling (MLM) task, which randomly

masks some of the tokens from the input, then pre-

dicts the masked word based on the context as the

training objective. We focus on the MLM task as it

is a prevalent pre-training task for many standard

models such as BERT (Devlin et al.,2019) and

RoBERTa (Liu et al.,2019) that serve as building

blocks for many downstream tasks. Using exam-

ples from a set of pre-deﬁned domains, we train a

3https://www.wikihow.com

BERT model from scratch to fully control our ex-

periment and isolate the effect of different domains.

This is crucial since a pre-trained BERT model is

already trained on multiple domains, hence hard to

drive correct conclusions through our analysis from

such a model. Moreover, recent studies (Magar and

Schwartz,2022;Brown et al.,2020) showed the

risk of exposure of large language models to test

data in the pre-training phase, also known as data

contamination.

Model

We use the BERT

BASE

(Devlin et al.,

2019) architecture for all of our experiments. We

train two types of models: the experimental model

, trained on all ﬁve domains with the MLM objec-

tive, and the control model

for

i∈[5]

trained

on the

th domain. We are particularly interested

in the effect of two aspects on the model represen-

tation: model capacity and data size. We use the

capacity of 100% for BERT

BASE

size. BERT

BASE

has 768-dimensional vectors for each layer, adding

up to a total of 110M parameters. We also experi-

ment with a reduced model capacity of 75%, 50%,

25%, and 10% by reducing the dimension of the

hidden layers. We follow Devlin et al. (2019) de-

sign choices, e.g., 12 layers with 12 attention heads

per layer. We set the base training data size (100%)

for

to be 50K, composed of 10K reviews per

domain. Each

is trained on single domain data

containing 10K reviews.

and

share all the

examples of domain

. To study the effect of data

size on model representation, we take subsets from

the data split and create smaller datasets: a 10%

split and a 50% split. We also create a 200% split to

simulate the case with abundant data. We provide

additional details about our training procedure in

Appendix A.

4 Experiments and Results

Our research questions (RQs) examine how

domain-speciﬁc information is encoded in

calculating its SVCCA score with

for a speciﬁc

. For a given domain, we use a held-out test set

for getting the experimental and control model rep-

resentations as an input for the SVCCA method.

Intuitively, a high SVCCA score between

and

indicates

stores domain-speciﬁc information

for domain

, as

was train solely on data from

domain

. A low SVCCA score between

and

could mean one of two things: a)

can gen-

eralize to data from

without explicitly storing

domain-speciﬁc information about it, or b)

can

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnderstandingDomainLearninginLanguageModelsThroughSubpopulationAnalysisZhengZhaoYftahZiserShayB.CohenInstituteforLanguage,CognitionandComputationSchoolofInformatics,UniversityofEdinburgh10CrichtonStreet,Edinburgh,EH89AB{zheng.zhao,yftah.ziser}@ed.ac.uk,scohen@inf.ed.ac.ukAbstractWeinvestigatehowdiff...

展开>> 收起<<

Understanding Domain Learning in Language Models Through Subpopulation Analysis Zheng Zhao Yftah Ziser Shay B. Cohen.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Understanding Domain Learning in Language Models Through Subpopulation Analysis Zheng Zhao Yftah Ziser Shay B. Cohen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: