
Books
: . . . the book didn’t have a proper ending but rather
a rushed attempt to conclude the story and put everyone
away neatly . . .
Clothing
: . . . clearly of awful quality, the design and paint
was totally wrong, the mask was short and stumpy as well
as slightly deformed and bent to the left . . .
Home
: . . . there are no handles, and the plastic gets too
hot to hold, so you have to awkwardly pour by the top . . .
Table 1: A representative sample of review snippets.
have a proper”). Similarly, we find an aspect (“han-
dle”) with a corresponding conveyed sentiment
(“too hot”) for the
Home
domain. We can see this
shared pattern across all domains, with different
aspects and sentiment terms. We would not expect
this to be the case for other datasets, which might
have different differentiators for domains. For ex-
ample, Amazon reviews and Wikipedia pages on
Books
domain may have a similar vocabulary,
however, a review is more likely to convey sen-
timent toward a particular book, and a Wikipedia
article is more likely to focus on describing the
book. Thus, the Amazon Reviews dataset is an
ideal testbed for our analysis.
In addition to the Amazon Reviews dataset, we
experimented on the WikiSum dataset (Cohen et al.,
2021) to further validate our findings. The Wik-
iSum dataset is a coherent paragraph summariza-
tion dataset based on the WikiHow website.
3
Wiki-
How consists of do-it-yourself (DIY) guides for the
general public, thus is written using simple English
and ranges over many domains. Similar to Ama-
zon Reviews, we also pick the top five domains for
our experiments:
Education
,
Food
,
Health
,
Home
, and
Pets
. Since the dataset is designed for
summarization, we concatenate the document and
summary together for our MLM task. We present
the results for this dataset at the end of § 4.
Task
We study the language modeling task to un-
derstand the nature of multi-domain learning better.
More precisely, we experiment with the masked
language modeling (MLM) task, which randomly
masks some of the tokens from the input, then pre-
dicts the masked word based on the context as the
training objective. We focus on the MLM task as it
is a prevalent pre-training task for many standard
models such as BERT (Devlin et al.,2019) and
RoBERTa (Liu et al.,2019) that serve as building
blocks for many downstream tasks. Using exam-
ples from a set of pre-defined domains, we train a
3https://www.wikihow.com
BERT model from scratch to fully control our ex-
periment and isolate the effect of different domains.
This is crucial since a pre-trained BERT model is
already trained on multiple domains, hence hard to
drive correct conclusions through our analysis from
such a model. Moreover, recent studies (Magar and
Schwartz,2022;Brown et al.,2020) showed the
risk of exposure of large language models to test
data in the pre-training phase, also known as data
contamination.
Model
We use the BERT
BASE
(Devlin et al.,
2019) architecture for all of our experiments. We
train two types of models: the experimental model
E
, trained on all five domains with the MLM objec-
tive, and the control model
Ci
for
i∈[5]
trained
on the
i
th domain. We are particularly interested
in the effect of two aspects on the model represen-
tation: model capacity and data size. We use the
capacity of 100% for BERT
BASE
size. BERT
BASE
has 768-dimensional vectors for each layer, adding
up to a total of 110M parameters. We also experi-
ment with a reduced model capacity of 75%, 50%,
25%, and 10% by reducing the dimension of the
hidden layers. We follow Devlin et al. (2019) de-
sign choices, e.g., 12 layers with 12 attention heads
per layer. We set the base training data size (100%)
for
E
to be 50K, composed of 10K reviews per
domain. Each
Ci
is trained on single domain data
containing 10K reviews.
E
and
Ci
share all the
examples of domain
i
. To study the effect of data
size on model representation, we take subsets from
the data split and create smaller datasets: a 10%
split and a 50% split. We also create a 200% split to
simulate the case with abundant data. We provide
additional details about our training procedure in
Appendix A.
4 Experiments and Results
Our research questions (RQs) examine how
domain-specific information is encoded in
E
by
calculating its SVCCA score with
Ci
for a specific
i
. For a given domain, we use a held-out test set
for getting the experimental and control model rep-
resentations as an input for the SVCCA method.
Intuitively, a high SVCCA score between
E
and
Ci
indicates
E
stores domain-specific information
for domain
i
, as
Ci
was train solely on data from
domain
i
. A low SVCCA score between
E
and
Ci
could mean one of two things: a)
E
can gen-
eralize to data from
di
without explicitly storing
domain-specific information about it, or b)
E
can