Can Demographic Factors Improve Text Classiﬁcation Revisiting Demographic Adaptation in the Age of Transformers Chia-Chien Hung15 Anne Lauscher2 Dirk Hovy3

2025-04-30 0 0 928.69KB 16 页 10玖币

侵权投诉

Can Demographic Factors Improve Text Classiﬁcation?

Revisiting Demographic Adaptation in the Age of Transformers

Chia-Chien Hung1,5, Anne Lauscher2, Dirk Hovy3,

Simone Paolo Ponzetto1and Goran Glavaš4

1Data and Web Science Group, University of Mannheim, Germany

2Data Science Group, University of Hamburg, Germany

3MilaNLP, Bocconi University, Italy 4CAIDAS, University of Würzburg, Germany

5NEC Laboratories Europe GmbH, Heidelberg, Germany

{chia-chien.hung, ponzetto}@uni-mannheim.de

anne.lauscher@uni-hamburg.de,dirk.hovy@unibocconi.it

goran.glavas@uni-wuerzburg.de

Abstract

Demographic factors (e.g., gender or age)

shape our language. Previous work showed

that incorporating demographic factors can

consistently improve performance for various

NLP tasks with traditional NLP models. In

this work, we investigate whether these pre-

vious ﬁndings still hold with state-of-the-art

pretrained Transformer-based language mod-

els (PLMs). We use three common specializa-

tion methods proven effective for incorporat-

ing external knowledge into pretrained Trans-

formers (e.g., domain-speciﬁc or geographic

knowledge). We adapt the language represen-

tations for the demographic dimensions of gen-

der and age, using continuous language model-

ing and dynamic multi-task learning for adap-

tation, where we couple language modeling

objectives with the prediction of demographic

classes. Our results, when employing a mul-

tilingual PLM, show substantial gains in task

performance across four languages (English,

German, French, and Danish), which is con-

sistent with the results of previous work. How-

ever, controlling for confounding factors – pri-

marily domain and language proﬁciency of

Transformer-based PLMs – shows that down-

stream performance gains from our demo-

graphic adaptation do not actually stem from

demographic knowledge. Our results indi-

cate that demographic specialization of PLMs,

while holding promise for positive societal im-

pact, still represents an unsolved problem for

(modern) NLP.

1 Introduction

Demographic factors like social class, education,

income, age, or gender, categorize people into spe-

ciﬁc groups or populations. At the same time,

demographic factors both shape and are reﬂected

in our language (e.g., Trudgill,2000;Eckert and

McConnell-Ginet,2013). A large body of work

focused on modeling demographic language vari-

ation, especially the correlations between words

and demographic factors (Bamman et al.,2014;

Garimella et al.,2017;Welch et al.,2020,inter

alia). In a similar vein, Volkova et al. (2013) and

Hovy (2015) demonstrated that explicitly incorpo-

rating demographic information in language repre-

sentations improves performance on downstream

NLP tasks, e.g., topic classiﬁcation or sentiment

analysis. However, these observations rely on ap-

proaches that leverage gender-speciﬁc lexica to spe-

cialize word embeddings and text encoders (e.g., re-

current networks) that have not been pretrained for

(general purpose) language understanding. To date,

the beneﬁts of demographic specialization have not

been tested with Transformer-based (Vaswani et al.,

2017) pretrained language models (PLMs), which

have been shown to excel on the vast majority of

NLP tasks and even surpass human performance in

some cases (Wang et al.,2018).

More recent studies focus mainly on monolin-

gual English datasets and introduce demographic

features in task-speciﬁc ﬁne-tuning (Voigt et al.,

2018;Buechel et al.,2018), which limits the bene-

ﬁts of demographic knowledge to tasks at hand. In

this work, we investigate the (task-agnostic) demo-

graphic specialization of PLMs, aiming to impart

the associations between demographic categories

and linguistic phenomena into the PLMs parame-

ters. If successful, such specialization could beneﬁt

any downstream NLP task in which demographic

factors (i.e., demographically conditioned language

phenomena) matter. For this, we adopt intermedi-

ate training paradigms that have been proven effec-

tive for the specialization of PLMs for other types

of knowledge, e.g., in domain, language, and geo-

graphic adaptation (Glavaš et al.,2020;Hung et al.,

2022a;Hofmann et al.,2022). To this effect, we

perform (i) continued language modeling on text

arXiv:2210.07362v2 [cs.CL] 9 May 2023

corpora produced by a demographic group and (ii)

dynamic multi-task learning (Kendall et al.,2018),

wherein we combine language modeling with the

prediction of demographic categories.

We evaluate the effectiveness of the demo-

graphic PLM specialization on both intrinsic (de-

mographic category prediction) and extrinsic (sen-

timent classiﬁcation and topic detection) evalua-

tion tasks across four languages: English, German,

French, and Danish, using a multilingual corpus

of reviews (Hovy et al.,2015) annotated with de-

mographic information. In line with earlier ﬁnd-

ings (Hovy,2015), our initial experiments based

on a multilingual PLM (mBERT; Devlin et al.,

2019), render demographic specialization effec-

tive: we observe gains in most tasks and settings.

Through a set of controlled experiments, where

we (1) adapt with in-domain language modeling

alone, without leveraging demographic informa-

tion, (2) demographically specialize monolingual

PLMs of evaluation languages, (3) carry out a meta-

regression analysis over dimensions that drive the

performance, and (4) analyze the topology of the

representation spaces of demographically special-

ized PLMs, we show, however, that most of the

original gains can be attributed to confounding ef-

fects of language and/or domain specialization.

Our ﬁndings indicate that specialization ap-

proaches, proven effective for other types of knowl-

edge, fail to adequately instill demographic knowl-

edge into PLMs, making demographic specializa-

tion of NLP models an open problem in the age

of large pretrained Transformers. Our research

code is publicly available at:

https://github.

com/umanlp/SocioAdapt.

2 Demographic Adaptation

Our goal is to inject demographic knowledge

through intermediate PLM training in a task-

agnostic manner. To achieve this goal, we

train the PLM in a dynamic multi-task learning

setup (Kendall et al.,2018), in which we couple

masked language modeling (MLM-ing) with pre-

dicting the demographic category – gender or age

group of the text author. Such multi-task learning

setup is designed to force the PLM to learn associ-

ations between the language constructs and demo-

graphic groups, if these associations are salient in

the training corpora.

Masked Language Modeling (MLM).

Follow-

ing successful work on pretraining via language

modeling for domain-adaptation (Gururangan et al.,

2020;Hung et al.,2022a), we investigate the effect

of running standard MLM-ing on the text corpora

of a speciﬁc demographic dimension (e.g., gender-

related corpora). We compute the MLM loss

Lmlm

in the common way, as negative log-likelihood of

the true token probability.

Demographic Category Prediction.

In the

multi-task learning setup, the representation of the

input text, as output by the Transformer, is addition-

ally fed into a classiﬁcation head that predicts the

corresponding demographic category: age (below

35 and above 45

), and gender (female and male).

The demographic prediction loss

Ldem

is computed

as the standard binary cross-entropy loss.

We experiment with two different ways of

predicting the demographic category of the text:

(i) from the transformed representation of the se-

quence start token (

[CLS]

) and (ii) from the contex-

tualized representations of each masked token. We

hypothesized that the former variant, in which we

predict the demographic class from the

[CLS]

to-

ken representation, would establish links between

more complex demographically condition linguis-

tic phenomena (e.g., syntactic patterns or patterns

of compositional semantics that a demographic

group might exhibit), whereas the latter – pre-

dicting demographic class from representations of

masked tokens – is more likely to establish simpler

lexical links, i.e., capture the vocabulary differ-

ences between the demographic groups.

Multi-Task Learning.

Since both losses can be

computed from the same input instances, we opt

for joint multi-task learning (MTL) and resort

to dynamic MTL based on the homoscedastic

uncertainty of the losses, wherein the loss vari-

ances are used to balance the contributions of the

tasks (Kendall et al.,2018). The intuition is that

more effective MTL occurs if we dynamically as-

sign less importance to more uncertain tasks, as

opposed to assigning uniform task weights through-

out the whole training. Homoscedastic uncertainty

weighting in MTL has been effective in different

NLP settings (Lauscher et al.,2018;Hofmann et al.,

2022). In our scenario,

Lmlm

and

Ldem

are mea-

sured on different scales in which the model would

favor (i.e., be more conﬁdent for) one objective

than the other. The conﬁdence level of the model

As suggested by Hovy (2015) the split for the age ranges

result in roughly equally-sized data sets for each sub-group

and is non-contiguous, avoiding fuzzy boundaries.

prediction for each task would change throughout

the training progress: this makes dynamic weight-

ing desirable. We dynamically prioritize the tasks

via homoscedastic uncertainties σt:

Lt=1

2σ2

Lt+ log σt,(1)

where

σ2

is the variance of the task-speciﬁc loss

over training instances for quantifying the uncer-

tainty of the task

t∈ {mlm, dem}

. In practice,

we train the network to predict the log variance,

ηt:= log σ2

, since it is more numerically stable

than regressing the variance

σ2

, as the log avoids

divisions by zero. The adjusted losses are then

computed as:

Lt=1

2(e−ηtLt+ηt).(2)

The ﬁnal loss we minimize is the sum of the two

uncertainty-adjusted losses: ˜

Lmlm +˜

Ldem.

3 Experimental Setup

Here we describe evaluation tasks and provide de-

tails on the data used for demographic specializa-

tion and downstream evaluation.

Evaluation Tasks.

We follow Hovy (2015) and

measure the effects of demographic specialization

of PLMs on three text-classiﬁcation tasks, coupling

intrinsic demographic attribute classiﬁcation (

)

with two extrinsic text classiﬁcation tasks: senti-

ment analysis (

) and topic detection (

). As

an intrinsic evaluation task, AC directly tests if the

intermediate demographic specialization results in

a PLM that can be more effectively ﬁne-tuned to

predict the same demographic classes used in the

intermediate specialization: PLMs (vanilla PLM

and our demographically specialized counterpart)

– are ﬁne-tuned in a supervised fashion to predict

the demographic class (gender or age) of the text

author. SA is a ternary classiﬁcation task in which

the reviews with ratings of

, and

stars rep-

resent instances of negative,neutral, and positive

class, respectively. TD classiﬁes texts into 5 differ-

ent topic categories. We report the

-measure for

each task following Hovy (2015).

Data.

We carry out our core experimentation on

the multilingual demographically labeled dataset

of reviews (Hovy et al.,2015), created from the

internationally popular user review website Trust-

pilot.

For comparison and consistency, we work

2https://www.trustpilot.com/

with exactly the same data portions as Hovy (2015):

collections that cover (1) two most prominent de-

mographic dimensions – gender and age, with two

categories in each (gender: male or female; age:

below 35 or above 45

) and (2) ﬁve countries (four

languages): United States (US), Denmark, Ger-

many, France, and United Kingdom (UK).

To avoid any information leakage, we ensure

– for each country-demographic dimension col-

lection (e.g., US, gender) – that there is zero

overlap between the portions we select for inter-

mediate demographic specialization and portions

used for downstream ﬁne-tuning and evaluation

(for AC, SA, and TD). (Specialization). For TD,

we aim to eliminate the confounding effect of

demographically-conditioned label distributions

(e.g., female authors wrote reviews for clothing

store more frequently than male authors; vice-versa

for electronics & technology). To this effect, we

select, for each country, reviews from the ﬁve most

frequent topics and sample the same number of

reviews in each topic for both demographic groups

(i.e., male and female for gender; below 35 and

above 45 for age). For the intrinsic AC task (i.e.,

ﬁne-tuning to predict either gender or age cate-

gory), we report the results for two different review

collections: the ﬁrst is the set of reviews that have,

besides the demographic classes, been annotated

with sentiment labels (we refer to this as AC-SA)

and the second are the reviews that have topic la-

bels (i.e., product/service category; we refer to this

portion as AC-TD). For these ﬁne-tuning and eval-

uation datasets, we make sure that the two demo-

graphic classes (male and female for gender under

35 and above 45 for age) are equally represented in

each dataset portion (train, development, and test).

Table 1displays the numbers of reviews for each

country, demographic aspect, and dataset portion

(specialization vs. ﬁne-tuning).

For intermediate specialization of the multilin-

gual model, we randomly sample 100K instances

per demographic group from the gender specializa-

tion portion and 50K instances each from the texts

reserved for age specialization concatenated across

all 5 countries. For the specialization of monolin-

gual PLMs, we randomly sample the same number

of instances but from the specialization portions

of a single country. Following the established pro-

cedure (e.g., Devlin et al.,2019;Liu et al.,2019),

As suggested by Hovy (2015), the split for the age ranges

results in roughly equally-sized data sets for each sub-group

and is non-contiguous, avoiding fuzzy boundaries.

gender age

Country Language Specialization SA, AC-SA TD, AC-TD Specialization SA, AC-SA TD, AC-TD

F M F / M <35 >45 <35 / >45

Denmark Danish 1,596,816 2,022,349 250,485 120,805 833,657 494,905 75,300 44,815

France French 489,778 614,495 67,305 55,570 40,448 36,182 6,570 6,120

Germany German 210,718 284,399 28,920 30,580 66,342 47,308 5,865 8,040

UK English 1,665,167 1,632,894 156,630 183,995 231,905 274,528 26,325 22,095

US English 575,951 778,877 72,270 61,585 124,924 70,015 6,495 12,090

Table 1: Number of instances in different portions of the Trustpilot dataset (Hovy et al.,2015) used in our exper-

iments. For each country (Denmark, France, Germany, UK, and US), we report the size of the specialization and

ﬁne-tuning portions, the latter for each of the two extrinsic tasks: Sentiment Analysis (SA) and Topic Detection

(TD). Note that we use the same SA and TD reviews for the intrinsic AC tasks of predicting the demographic

categories (denoted AC-SA and AC-TD, respectively). Numbers are shown separately for the two demographic

dimensions: gender and age. For ﬁne-tuning datasets (for SA/AC-SA, and for TD/AC-TD), we indicate the number

of instances in each category (which is the same for both categories: F and M for gender, <35 and >45 for age).

We split the ﬁne-tuning datasets randomly into train, validation, and test portions in the 60/20/20 ratio.

we dynamically mask 15% of the tokens in the

demographic specialization portions for MLM.

Pre-trained language models.

Given that we

experiment with Trustpilot data in four different

languages, in our core experiments, we resorted

to multilingual BERT (mBERT)

(Devlin et al.,

2019) as the starting PLM. This allows us to merge

the (fairly large) specialization portions of Trust-

pilot in different languages (see Table 1) and run

a single multilingual demographic specialization

procedure on the combined multilingual review

corpus. We then ﬁne-tune the demographically-

specialized mBERT and evaluate downstream task

performance separately for each of the ﬁve coun-

tries (using train, development, and test portions

of the respective country). We report the results

for two different variants of our dynamic multi-

task demographic specialization (DS): (1) when

the demographic category is predicted from rep-

resentations of masked tokens (DS-Tok) and (2)

when we predict the demographic category from

the encoding of the whole sequence (i.e., review;

this version is denoted with DS-Seq). We com-

pare these demographic-specialized PLM variants

against two baselines: vanilla PLM and PLM spe-

cialized on the same review corpora as our MTL

variants but only via MLM-ing (i.e., without pro-

viding the demographic signal).

Training and Optimization.

In demographic

specialization training, we ﬁx the maximum se-

quence length to

128

subword tokens. We train for

epochs in batches of

instances and search

for the optimal learning rate among the follow-

We load the

bert-base-multilingual-cased

weights

from HuggingFace Transformers.

ing values:

{5·10−5,1·10−5,1·10−6}

. We ap-

ply early stopping based on the development set

performance: we stop if the joint MTL loss does

not improve for 3 epochs). For downstream ﬁne-

tuning and evaluation, we train for maximum

epochs in batches of

. We search for the op-

timal learning rate between the following values:

{5·10−5,1·10−5,5·10−6,1·10−6}

and apply early

stopping based on the validation set performance

(patience: 5 epochs). We use AdamW (Loshchilov

and Hutter,2019) as the optimization algorithm.

4 Results and Discussion

We ﬁrst discuss the results of multilingual demo-

graphic specialization with mBERT as the PLM

(§4.1). We then provide a series of control experi-

ments in which we isolate the effects that contribute

to performance gains of demographically special-

ized PLMs (§4.2).

4.1 Multilingual Specialization Results

Table 2shows the results of gender- and age-

specialized mBERT variants – DS-Seq and DS-

Tok – on gender and age classiﬁcation (AC-SA and

AC-TD) as intrinsic tasks together with sentiment

analysis (SA) and topic detection (TD) as extrinsic

evaluation tasks, for each of the ﬁve countries en-

compassed by the Trustpilot datasets (Hovy et al.,

2015). The performance of DS-Seq and DS-Tok is

compared against the PLM baselines that have not

been exposed to demographic information: vanilla

mBERT and mBERT with additional MLM-ing

on the same Trustpilot data on which DS-Seq and

DS-Tok were trained.

Our demographically specialized models gen-

erally outperform the vanilla mBERT across the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CanDemographicFactorsImproveTextClassication?RevisitingDemographicAdaptationintheAgeofTransformersChia-ChienHung1,5,AnneLauscher2,DirkHovy3,SimonePaoloPonzetto1andGoranGlava41DataandWebScienceGroup,UniversityofMannheim,Germany2DataScienceGroup,UniversityofHamburg,Germany3MilaNLP,BocconiUniversity,...

展开>> 收起<<

Can Demographic Factors Improve Text Classiﬁcation Revisiting Demographic Adaptation in the Age of Transformers Chia-Chien Hung15 Anne Lauscher2 Dirk Hovy3.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Can Demographic Factors Improve Text Classiﬁcation Revisiting Demographic Adaptation in the Age of Transformers Chia-Chien Hung15 Anne Lauscher2 Dirk Hovy3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: