Can Demographic Factors Improve Text Classification Revisiting Demographic Adaptation in the Age of Transformers Chia-Chien Hung15 Anne Lauscher2 Dirk Hovy3

2025-04-30 0 0 928.69KB 16 页 10玖币
侵权投诉
Can Demographic Factors Improve Text Classification?
Revisiting Demographic Adaptation in the Age of Transformers
Chia-Chien Hung1,5, Anne Lauscher2, Dirk Hovy3,
Simone Paolo Ponzetto1and Goran Glavaš4
1Data and Web Science Group, University of Mannheim, Germany
2Data Science Group, University of Hamburg, Germany
3MilaNLP, Bocconi University, Italy 4CAIDAS, University of Würzburg, Germany
5NEC Laboratories Europe GmbH, Heidelberg, Germany
{chia-chien.hung, ponzetto}@uni-mannheim.de
anne.lauscher@uni-hamburg.de,dirk.hovy@unibocconi.it
goran.glavas@uni-wuerzburg.de
Abstract
Demographic factors (e.g., gender or age)
shape our language. Previous work showed
that incorporating demographic factors can
consistently improve performance for various
NLP tasks with traditional NLP models. In
this work, we investigate whether these pre-
vious findings still hold with state-of-the-art
pretrained Transformer-based language mod-
els (PLMs). We use three common specializa-
tion methods proven effective for incorporat-
ing external knowledge into pretrained Trans-
formers (e.g., domain-specific or geographic
knowledge). We adapt the language represen-
tations for the demographic dimensions of gen-
der and age, using continuous language model-
ing and dynamic multi-task learning for adap-
tation, where we couple language modeling
objectives with the prediction of demographic
classes. Our results, when employing a mul-
tilingual PLM, show substantial gains in task
performance across four languages (English,
German, French, and Danish), which is con-
sistent with the results of previous work. How-
ever, controlling for confounding factors – pri-
marily domain and language proficiency of
Transformer-based PLMs – shows that down-
stream performance gains from our demo-
graphic adaptation do not actually stem from
demographic knowledge. Our results indi-
cate that demographic specialization of PLMs,
while holding promise for positive societal im-
pact, still represents an unsolved problem for
(modern) NLP.
1 Introduction
Demographic factors like social class, education,
income, age, or gender, categorize people into spe-
cific groups or populations. At the same time,
demographic factors both shape and are reflected
in our language (e.g., Trudgill,2000;Eckert and
McConnell-Ginet,2013). A large body of work
focused on modeling demographic language vari-
ation, especially the correlations between words
and demographic factors (Bamman et al.,2014;
Garimella et al.,2017;Welch et al.,2020,inter
alia). In a similar vein, Volkova et al. (2013) and
Hovy (2015) demonstrated that explicitly incorpo-
rating demographic information in language repre-
sentations improves performance on downstream
NLP tasks, e.g., topic classification or sentiment
analysis. However, these observations rely on ap-
proaches that leverage gender-specific lexica to spe-
cialize word embeddings and text encoders (e.g., re-
current networks) that have not been pretrained for
(general purpose) language understanding. To date,
the benefits of demographic specialization have not
been tested with Transformer-based (Vaswani et al.,
2017) pretrained language models (PLMs), which
have been shown to excel on the vast majority of
NLP tasks and even surpass human performance in
some cases (Wang et al.,2018).
More recent studies focus mainly on monolin-
gual English datasets and introduce demographic
features in task-specific fine-tuning (Voigt et al.,
2018;Buechel et al.,2018), which limits the bene-
fits of demographic knowledge to tasks at hand. In
this work, we investigate the (task-agnostic) demo-
graphic specialization of PLMs, aiming to impart
the associations between demographic categories
and linguistic phenomena into the PLMs parame-
ters. If successful, such specialization could benefit
any downstream NLP task in which demographic
factors (i.e., demographically conditioned language
phenomena) matter. For this, we adopt intermedi-
ate training paradigms that have been proven effec-
tive for the specialization of PLMs for other types
of knowledge, e.g., in domain, language, and geo-
graphic adaptation (Glavaš et al.,2020;Hung et al.,
2022a;Hofmann et al.,2022). To this effect, we
perform (i) continued language modeling on text
arXiv:2210.07362v2 [cs.CL] 9 May 2023
corpora produced by a demographic group and (ii)
dynamic multi-task learning (Kendall et al.,2018),
wherein we combine language modeling with the
prediction of demographic categories.
We evaluate the effectiveness of the demo-
graphic PLM specialization on both intrinsic (de-
mographic category prediction) and extrinsic (sen-
timent classification and topic detection) evalua-
tion tasks across four languages: English, German,
French, and Danish, using a multilingual corpus
of reviews (Hovy et al.,2015) annotated with de-
mographic information. In line with earlier find-
ings (Hovy,2015), our initial experiments based
on a multilingual PLM (mBERT; Devlin et al.,
2019), render demographic specialization effec-
tive: we observe gains in most tasks and settings.
Through a set of controlled experiments, where
we (1) adapt with in-domain language modeling
alone, without leveraging demographic informa-
tion, (2) demographically specialize monolingual
PLMs of evaluation languages, (3) carry out a meta-
regression analysis over dimensions that drive the
performance, and (4) analyze the topology of the
representation spaces of demographically special-
ized PLMs, we show, however, that most of the
original gains can be attributed to confounding ef-
fects of language and/or domain specialization.
Our findings indicate that specialization ap-
proaches, proven effective for other types of knowl-
edge, fail to adequately instill demographic knowl-
edge into PLMs, making demographic specializa-
tion of NLP models an open problem in the age
of large pretrained Transformers. Our research
code is publicly available at:
https://github.
com/umanlp/SocioAdapt.
2 Demographic Adaptation
Our goal is to inject demographic knowledge
through intermediate PLM training in a task-
agnostic manner. To achieve this goal, we
train the PLM in a dynamic multi-task learning
setup (Kendall et al.,2018), in which we couple
masked language modeling (MLM-ing) with pre-
dicting the demographic category – gender or age
group of the text author. Such multi-task learning
setup is designed to force the PLM to learn associ-
ations between the language constructs and demo-
graphic groups, if these associations are salient in
the training corpora.
Masked Language Modeling (MLM).
Follow-
ing successful work on pretraining via language
modeling for domain-adaptation (Gururangan et al.,
2020;Hung et al.,2022a), we investigate the effect
of running standard MLM-ing on the text corpora
of a specific demographic dimension (e.g., gender-
related corpora). We compute the MLM loss
Lmlm
in the common way, as negative log-likelihood of
the true token probability.
Demographic Category Prediction.
In the
multi-task learning setup, the representation of the
input text, as output by the Transformer, is addition-
ally fed into a classification head that predicts the
corresponding demographic category: age (below
35 and above 45
1
), and gender (female and male).
The demographic prediction loss
Ldem
is computed
as the standard binary cross-entropy loss.
We experiment with two different ways of
predicting the demographic category of the text:
(i) from the transformed representation of the se-
quence start token (
[CLS]
) and (ii) from the contex-
tualized representations of each masked token. We
hypothesized that the former variant, in which we
predict the demographic class from the
[CLS]
to-
ken representation, would establish links between
more complex demographically condition linguis-
tic phenomena (e.g., syntactic patterns or patterns
of compositional semantics that a demographic
group might exhibit), whereas the latter – pre-
dicting demographic class from representations of
masked tokens – is more likely to establish simpler
lexical links, i.e., capture the vocabulary differ-
ences between the demographic groups.
Multi-Task Learning.
Since both losses can be
computed from the same input instances, we opt
for joint multi-task learning (MTL) and resort
to dynamic MTL based on the homoscedastic
uncertainty of the losses, wherein the loss vari-
ances are used to balance the contributions of the
tasks (Kendall et al.,2018). The intuition is that
more effective MTL occurs if we dynamically as-
sign less importance to more uncertain tasks, as
opposed to assigning uniform task weights through-
out the whole training. Homoscedastic uncertainty
weighting in MTL has been effective in different
NLP settings (Lauscher et al.,2018;Hofmann et al.,
2022). In our scenario,
Lmlm
and
Ldem
are mea-
sured on different scales in which the model would
favor (i.e., be more confident for) one objective
than the other. The confidence level of the model
1
As suggested by Hovy (2015) the split for the age ranges
result in roughly equally-sized data sets for each sub-group
and is non-contiguous, avoiding fuzzy boundaries.
prediction for each task would change throughout
the training progress: this makes dynamic weight-
ing desirable. We dynamically prioritize the tasks
via homoscedastic uncertainties σt:
˜
Lt=1
2σ2
t
Lt+ log σt,(1)
where
σ2
t
is the variance of the task-specific loss
over training instances for quantifying the uncer-
tainty of the task
t∈ {mlm, dem}
. In practice,
we train the network to predict the log variance,
ηt:= log σ2
t
, since it is more numerically stable
than regressing the variance
σ2
t
, as the log avoids
divisions by zero. The adjusted losses are then
computed as:
˜
Lt=1
2(eηtLt+ηt).(2)
The final loss we minimize is the sum of the two
uncertainty-adjusted losses: ˜
Lmlm +˜
Ldem.
3 Experimental Setup
Here we describe evaluation tasks and provide de-
tails on the data used for demographic specializa-
tion and downstream evaluation.
Evaluation Tasks.
We follow Hovy (2015) and
measure the effects of demographic specialization
of PLMs on three text-classification tasks, coupling
intrinsic demographic attribute classification (
AC
)
with two extrinsic text classification tasks: senti-
ment analysis (
SA
) and topic detection (
TD
). As
an intrinsic evaluation task, AC directly tests if the
intermediate demographic specialization results in
a PLM that can be more effectively fine-tuned to
predict the same demographic classes used in the
intermediate specialization: PLMs (vanilla PLM
and our demographically specialized counterpart)
– are fine-tuned in a supervised fashion to predict
the demographic class (gender or age) of the text
author. SA is a ternary classification task in which
the reviews with ratings of
1
,
3
, and
5
stars rep-
resent instances of negative,neutral, and positive
class, respectively. TD classifies texts into 5 differ-
ent topic categories. We report the
F1
-measure for
each task following Hovy (2015).
Data.
We carry out our core experimentation on
the multilingual demographically labeled dataset
of reviews (Hovy et al.,2015), created from the
internationally popular user review website Trust-
pilot.
2
For comparison and consistency, we work
2https://www.trustpilot.com/
with exactly the same data portions as Hovy (2015):
collections that cover (1) two most prominent de-
mographic dimensions – gender and age, with two
categories in each (gender: male or female; age:
below 35 or above 45
3
) and (2) five countries (four
languages): United States (US), Denmark, Ger-
many, France, and United Kingdom (UK).
To avoid any information leakage, we ensure
– for each country-demographic dimension col-
lection (e.g., US, gender) – that there is zero
overlap between the portions we select for inter-
mediate demographic specialization and portions
used for downstream fine-tuning and evaluation
(for AC, SA, and TD). (Specialization). For TD,
we aim to eliminate the confounding effect of
demographically-conditioned label distributions
(e.g., female authors wrote reviews for clothing
store more frequently than male authors; vice-versa
for electronics & technology). To this effect, we
select, for each country, reviews from the five most
frequent topics and sample the same number of
reviews in each topic for both demographic groups
(i.e., male and female for gender; below 35 and
above 45 for age). For the intrinsic AC task (i.e.,
fine-tuning to predict either gender or age cate-
gory), we report the results for two different review
collections: the first is the set of reviews that have,
besides the demographic classes, been annotated
with sentiment labels (we refer to this as AC-SA)
and the second are the reviews that have topic la-
bels (i.e., product/service category; we refer to this
portion as AC-TD). For these fine-tuning and eval-
uation datasets, we make sure that the two demo-
graphic classes (male and female for gender under
35 and above 45 for age) are equally represented in
each dataset portion (train, development, and test).
Table 1displays the numbers of reviews for each
country, demographic aspect, and dataset portion
(specialization vs. fine-tuning).
For intermediate specialization of the multilin-
gual model, we randomly sample 100K instances
per demographic group from the gender specializa-
tion portion and 50K instances each from the texts
reserved for age specialization concatenated across
all 5 countries. For the specialization of monolin-
gual PLMs, we randomly sample the same number
of instances but from the specialization portions
of a single country. Following the established pro-
cedure (e.g., Devlin et al.,2019;Liu et al.,2019),
3
As suggested by Hovy (2015), the split for the age ranges
results in roughly equally-sized data sets for each sub-group
and is non-contiguous, avoiding fuzzy boundaries.
gender age
Country Language Specialization SA, AC-SA TD, AC-TD Specialization SA, AC-SA TD, AC-TD
F M F / M <35 >45 <35 / >45
Denmark Danish 1,596,816 2,022,349 250,485 120,805 833,657 494,905 75,300 44,815
France French 489,778 614,495 67,305 55,570 40,448 36,182 6,570 6,120
Germany German 210,718 284,399 28,920 30,580 66,342 47,308 5,865 8,040
UK English 1,665,167 1,632,894 156,630 183,995 231,905 274,528 26,325 22,095
US English 575,951 778,877 72,270 61,585 124,924 70,015 6,495 12,090
Table 1: Number of instances in different portions of the Trustpilot dataset (Hovy et al.,2015) used in our exper-
iments. For each country (Denmark, France, Germany, UK, and US), we report the size of the specialization and
fine-tuning portions, the latter for each of the two extrinsic tasks: Sentiment Analysis (SA) and Topic Detection
(TD). Note that we use the same SA and TD reviews for the intrinsic AC tasks of predicting the demographic
categories (denoted AC-SA and AC-TD, respectively). Numbers are shown separately for the two demographic
dimensions: gender and age. For fine-tuning datasets (for SA/AC-SA, and for TD/AC-TD), we indicate the number
of instances in each category (which is the same for both categories: F and M for gender, <35 and >45 for age).
We split the fine-tuning datasets randomly into train, validation, and test portions in the 60/20/20 ratio.
we dynamically mask 15% of the tokens in the
demographic specialization portions for MLM.
Pre-trained language models.
Given that we
experiment with Trustpilot data in four different
languages, in our core experiments, we resorted
to multilingual BERT (mBERT)
4
(Devlin et al.,
2019) as the starting PLM. This allows us to merge
the (fairly large) specialization portions of Trust-
pilot in different languages (see Table 1) and run
a single multilingual demographic specialization
procedure on the combined multilingual review
corpus. We then fine-tune the demographically-
specialized mBERT and evaluate downstream task
performance separately for each of the five coun-
tries (using train, development, and test portions
of the respective country). We report the results
for two different variants of our dynamic multi-
task demographic specialization (DS): (1) when
the demographic category is predicted from rep-
resentations of masked tokens (DS-Tok) and (2)
when we predict the demographic category from
the encoding of the whole sequence (i.e., review;
this version is denoted with DS-Seq). We com-
pare these demographic-specialized PLM variants
against two baselines: vanilla PLM and PLM spe-
cialized on the same review corpora as our MTL
variants but only via MLM-ing (i.e., without pro-
viding the demographic signal).
Training and Optimization.
In demographic
specialization training, we fix the maximum se-
quence length to
128
subword tokens. We train for
30
epochs in batches of
32
instances and search
for the optimal learning rate among the follow-
4
We load the
bert-base-multilingual-cased
weights
from HuggingFace Transformers.
ing values:
{5·105,1·105,1·106}
. We ap-
ply early stopping based on the development set
performance: we stop if the joint MTL loss does
not improve for 3 epochs). For downstream fine-
tuning and evaluation, we train for maximum
20
epochs in batches of
32
. We search for the op-
timal learning rate between the following values:
{5·105,1·105,5·106,1·106}
and apply early
stopping based on the validation set performance
(patience: 5 epochs). We use AdamW (Loshchilov
and Hutter,2019) as the optimization algorithm.
4 Results and Discussion
We first discuss the results of multilingual demo-
graphic specialization with mBERT as the PLM
4.1). We then provide a series of control experi-
ments in which we isolate the effects that contribute
to performance gains of demographically special-
ized PLMs (§4.2).
4.1 Multilingual Specialization Results
Table 2shows the results of gender- and age-
specialized mBERT variants – DS-Seq and DS-
Tok – on gender and age classification (AC-SA and
AC-TD) as intrinsic tasks together with sentiment
analysis (SA) and topic detection (TD) as extrinsic
evaluation tasks, for each of the five countries en-
compassed by the Trustpilot datasets (Hovy et al.,
2015). The performance of DS-Seq and DS-Tok is
compared against the PLM baselines that have not
been exposed to demographic information: vanilla
mBERT and mBERT with additional MLM-ing
on the same Trustpilot data on which DS-Seq and
DS-Tok were trained.
Our demographically specialized models gen-
erally outperform the vanilla mBERT across the
摘要:

CanDemographicFactorsImproveTextClassication?RevisitingDemographicAdaptationintheAgeofTransformersChia-ChienHung1,5,AnneLauscher2,DirkHovy3,SimonePaoloPonzetto1andGoranGlavaš41DataandWebScienceGroup,UniversityofMannheim,Germany2DataScienceGroup,UniversityofHamburg,Germany3MilaNLP,BocconiUniversity,...

展开>> 收起<<
Can Demographic Factors Improve Text Classification Revisiting Demographic Adaptation in the Age of Transformers Chia-Chien Hung15 Anne Lauscher2 Dirk Hovy3.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:928.69KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注