Discovering Differences in the Representation of People using Contextualized Semantic Axes Li Lucy Divya Tadimeti and David Bamman

2025-05-03 0 0 1.64MB 18 页 10玖币

侵权投诉

Discovering Differences in the Representation of

People using Contextualized Semantic Axes

Li Lucy, Divya Tadimeti, and David Bamman

University of California, Berkeley

{lucy3_li, dtadimeti, dbamman}@berkeley.edu

Abstract

A common paradigm for identifying seman-

tic differences across social and temporal con-

texts is the use of static word embeddings and

their distances. In particular, past work has

compared embeddings against “semantic axes”

that represent two opposing concepts. We ex-

tend this paradigm to BERT embeddings, and

construct contextualized axes that mitigate the

pitfall where antonyms have neighboring rep-

resentations. We validate and demonstrate

these axes on two people-centric datasets: oc-

cupations from Wikipedia, and multi-platform

discussions in extremist, men’s communities

over fourteen years. In both studies, contex-

tualized semantic axes can characterize differ-

ences among instances of the same word type.

In the latter study, we show that references to

women and the contexts around them have be-

come more detestable over time.

1 Introduction

Warning

: This paper contains content that may be

offensive or upsetting.

Quantifying and describing the nature of lan-

guage differences is key to measuring the impact

of social and cultural factors on text. Past work has

compared English embeddings for people to adjec-

tives or concepts (Garg et al.,2018;Mendelsohn

et al.,2020;Charlesworth et al.,2022), or projected

embeddings against axes representing contrasting

attributes (Turney and Littman,2003;An et al.,

2018;Kozlowski et al.,2019;Field and Tsvetkov,

2019;Mathew et al.,2020;Kwak et al.,2021;Lucy

and Bamman,2021b;Fraser et al.,2021;Grand

et al.,2022). Static representations for the same

word can also be juxtaposed across corpora that

reﬂect different time periods (Gonen et al.,2020;

Hamilton et al.,2016). This paradigm of using em-

bedding distances to uncover socially meaningful

patterns has also transferred over to studies that

measure biases in contextualized embeddings, such

as Wolfe and Caliskan (2021)’s ﬁnding that BERT

beautiful ugly

… gorgeous equipage,

ornately attired …

… a grotesque,

monster-like costume…

… there are beautiful

women who are willing …

… literally ﬁlled with

garbage women …

Figure 1: An axis is constructed using embeddings of

adjectives in selected contexts. These contexts are pre-

dictive of synonyms, but not antonyms, of the target

adjective during masked language modeling. Token-

level embeddings for people are then projected onto

this axis.

embeddings of less frequent minority names are

closer to words related to unpleasantness.

The use of “semantic axes” is enticing in that it

offers an interpretable measurement of word differ-

ences beyond a single similarity value (Turney and

Littman,2003;An et al.,2018;Kozlowski et al.,

2019;Kwak et al.,2021). Words are projected onto

axes where the poles represent antonymous con-

cepts (such as beautiful–ugly), and the projected

embedding’s location along the axis indicates how

similar it is to either concept. Semantic axes con-

structed using static, type-based embeddings have

been used to analyze socially meaningful differ-

ences, such as words’ associations with class (Ko-

zlowski et al.,2019), or gender stereotypes in nar-

ratives (Huang et al.,2021;Lucy and Bamman,

2021b).

Our work investigates the extension and appli-

cation of semantic axes to contextualized embed-

dings. We present a novel approach for construct-

ing semantic axes with English BERT embeddings

(Figure 1). These axes are built to encourage self-

consistency, where antonymous poles are less con-

ﬂated with each other. They are able to capture

semantic differences across word types as well as

variation in a single word across contexts. Their

ability to differentiate contexts makes them suitable

arXiv:2210.12170v1 [cs.CL] 21 Oct 2022

for studying how a word changes across domains

or across individual sentences. These axes are also

more self-consistent and coherent than ones created

using GloVe and other baseline approaches.

We demonstrate the use of contextualized axes

on two datasets: occupations from Wikipedia, and

people discussed in misogynistic online commu-

nities. We use the former as a case where terms

appear in deﬁnitional contexts, and characteristics

of people are well-known. In the latter longitudi-

nal, cross-platform case study, we examine lexical

choices made by communities whose attitudes to-

wards women tend to be salient and extreme. We

chose this set of online communities as a substan-

tive use case of our method, in light of recent at-

tention in web science on analyzing online extrem-

ism and hate at scale (e.g. Ribeiro et al.,2021b,a;

Aliapoulios et al.,2021). There, we analyze lan-

guage change and variation along axes through a

sociolinguistic lens, emphasizing that speakers use

language that reﬂects their social identities and be-

liefs (CH-Wang and Jurgens,2021;Huffaker and

Calvert,2017;Card et al.,2016;Lakoff and Fergu-

son,2006).

Our code, vocabularies, and other resources can

be found in our Github repo:

https://github.c

om/lucy3/context_semantic_axes.

2 Constructing semantic axes

Static embeddings.

Several formulae for calculat-

ing the similarity of a target word to two sets of

pole words have been proposed in prior work on

static semantic axes. These differ in whether they

take the difference between a target word’s simi-

larities to each pole (Turney and Littman,2003),

calculate a target word’s similarity to the differ-

ence between pole averages (An et al.,2018;Kwak

et al.,2021), or calculate a target word’s similar-

ity to the average of several word pair differences

that represent the same antonymous relationship

(Kozlowski et al.,2019). We build on the approach

of An et al. (2018) and Kwak et al. (2021), be-

cause it does not require us to curate multiple

paired antonyms for each axis, and it draws out

the difference between two concepts before a tar-

get word is compared to them, rather than after.

We deﬁne an axis

containing antonymous sets

of adjective vectors,

Sl={l1, l2, l3, ..., ln}

and

Sr={r1, r2, r3, ..., rm}, as the following:

V=1

i=1

li−1

j=1

rj.

Relying on single-word poles for axes can be un-

stable to the choice of each word (An et al.,2018;

Antoniak and Mimno,2021). An et al. (2018) cre-

ates a pole’s set of words using the nearest neigh-

bors of a seed word, which may risk conﬂating

unintended meanings or antonymous neighbors

(Mrkši´

c et al.,2016;Sedoc et al.,2017). For exam-

ple, one axis uses the opposite seed words green

and experienced, but green’s nearest neighbors in-

clude red rather than inexperienced. Instead of us-

ing this nearest neighbors approach, we construct

poles using WordNet antonym relations. Each end

of an axis aggregates synonymous and similar lem-

mas in WordNet synsets, which are expanded using

the similar to relation (Miller,1992).

Our type-based embedding baseline, GLOVE,

uses 300-dimensional GloVe vectors pretrained on

Wikipedia and Gigaword (Pennington et al.,2014).

We only keep poles where both sides have at least

three adjectives that appear in the GloVe vocab-

ulary, and we also exclude acronyms, which are

often more ambiguous in meaning. We start with

723 axes, where poles have on average 9.63 adjec-

tives each.

Contextualized embeddings.

Static embed-

dings, however, present a number of limitations.

Such embeddings cannot easily handle polysemy

or homonymy (Wiedemann et al.,2019), and even

when they are trained on different social or tem-

poral contexts, they require additional steps to be

aligned (Gonen et al.,2020). Context-speciﬁc em-

beddings also need enough training examples of

target words to create usable representations. These

limitations prevent the analysis of token-based se-

mantic variation, such as measuring how one men-

tion of a word is more or less beautiful than another.

Our main contribution of contextualized axes uses

the same WordNet-based formulation as our GloVe

baseline. Rather than each word in

being

represented by a single GloVe embedding, we ob-

tain BERT embeddings over multiple occurrences

of each adjective. We use BERT-base, as this model

is small enough for efﬁcient application on large

datasets and is popular in previous work on seman-

tic change and differences (e.g. Hu et al.,2019;

Lucy and Bamman,2021a;Giulianelli et al.,2020;

Zhou et al.,2022;Coll Ardanuy et al.,2020;Mar-

tinc et al.,2020). It is also used in tutorials for

researchers outside of NLP, which means it has

high potential use in computational social science

and cultural analytics (Mimno et al.,2022).

For contextualized axes, we obtain a potential

pool of contexts for adjectives sampled over all of

Wikipedia from December 21, 2021, preprocessed

using Attardi (2015)’s text extractor. This sample

contains up to 1000 sentences, or contexts, that

contain each adjective, and we avoid contexts that

are too short (over 10 tokens) or too long (over 150

tokens).1

We experiment with two methods of obtaining

contextualized BERT embeddings for each adjec-

tive: a random “default" (BERT-DEFAULT) and one

where contexts are picked based on word probabili-

ties (BERT-PROB). For BERT-DEFAULT, we take a

random sample of 100 contextualized embeddings

across the adjectives in each pole. Since words

can be nearest neighbors with their antonyms in

semantic space (Mrkši´

c et al.,2016;Sedoc et al.,

2017), our main approach, BERT-PROB, aggregates

word embeddings over contexts that highlight con-

trasting meanings of axes’ poles.

To select contexts, we mask out the target ad-

jective in each of its 1000 sentences, and have

BERT-base predict the probabilities of synonyms

and antonyms for that masked token. We remove

contexts where the average probability of antonyms

is greater than that of synonyms, sort by average

synonym probability, and take the top 100 contexts.

One limitation of our approach is that predictions

are restricted to adjectives that can be represented

by one wordpiece token. If none of the words on a

pole of an axis appear in BERT’s vocabulary, we

backoff to BERT-DEFAULT to represent that axis.

For each axis type, we also have versions where

words’ embeddings are

-scored, which has been

shown to improve BERT’s alignment with humans’

word similarity judgements (Timkey and van Schi-

jndel,2021). For

-scoring, we calculate mean and

standard deviation BERT embeddings from a sam-

ple of around 370k whole words from Wikipedia.

As recommended by Bommasani et al. (2020), we

use mean pooling over wordpieces to produce word

representations when necessary, and we extend this

approach to create bigram representations as well.

These embeddings are a concatenation of the last

four layers of BERT, as these tend to capture more

context-speciﬁc information (Ethayarajh,2019).

This length cutoff made the data more manageable, and

90% of BERT’s training steps were originally on 128-length

sequences (Devlin et al.,2019).

Method Average C# of consistent axes

GLOVE 0.101 (0.006) 503

BERT-DEFAULT 0.084 (0.006) 393

BERT-DEFAULTz0.111 (0.007) 468

BERT-PROB 0.101 (0.006) 436

BERT-PROBz0.133 (0.007) 512

Table 1: A table of C, averaged across poles, with 95%

conﬁdence intervals (CI) in parentheses. The zsymbol

represents z-scored approaches.

3 Internal validation

We internally validate our axes for self-consistency.

For each axis, we remove one adjective’s embed-

dings from either side, and compute its cosine sim-

ilarity to the axis constructed from the remaining

adjectives. For BERT approaches, we average the

adjective’s multiple embeddings to produce only

one before computing its similarity to the axis. In

a “consistent” axis, a left-out adjective should be

closer to the pole it belongs to. That is, if it be-

longs to

, its similarity to the axis should be pos-

itive. We average these leave-one-out similarities

for each pole, negating the score when the adjective

belongs to

, to produce a consistency metric,

Table 1shows

for different axis-building meth-

ods.

An axis is “consistent” if both of its poles

have C ≥ 0.

GLOVE’s most inconsistent axis poles often in-

volve directions, such as east

↔

west,left-handed

↔

right-handed, and right

↔

left. These concepts

may be difﬁcult to learn from text without ground-

ing. We ﬁnd that the various BERT approaches’

most inconsistent axes include direction-related

ones as well, but they also struggle to separate

concepts such as lower-class ↔upper-class.

The best method for producing consistent axes

-scored BERT-PROB, with a signiﬁcant dif-

ference in

from

-scored BERT-DEFAULT and

GLOVE (Mann-Whitney U-test,

p < 0.001

). It also

produces the highest number of consistent axes.

GLOVE presents itself as a formidable baseline,

and BERT-DEFAULT struggles in comparison to it.

4 External validation

Previous work on static semantic axes validates

them using sentiment lexicons, exploratory anal-

We assign

to 0 if only one unique adjective’s contexts

are chosen to create a pole for BERT-PROB, because in that

case, we are unable to run the leave-one-out test for that pole.

We also tried

-scoring GLOVE embeddings, but this wors-

ened internal consistency (C= 0.098).

Category Occupation Experiment Person Experiment

Writing creative, fanciful, ﬁctive formal, logical, discursive + folksy, unceremonious, casual + ignoble, common, plebeian

Entertainment transcribed, taped, recorded structural, constructive, creative + trademarked, branded, copyrighted + emotional, soupy, slushy

Art unostentatious, aesthetic, artistic creative, fanciful, ﬁctive + activist, active, hands-on + practiced, proﬁcient, adept

Health unhealthy, pathologic, asthmatic rehabilitative, structural, constructive + conﬁrmable, empirical, experiential + teetotal, dry, drug-free

Agriculture drifting, mobile, unsettled rustic, agrarian, bucolic + boneless, deboned, boned - rehabilitative, structural, constructive

Government amenable, answerable, responsible policy-making, political, governmental + respectful, deferential, honoriﬁc + amenable, answerable, responsible

Sports spry, gymnastic, sporty zealous, ardent, enthusiastic - amenable, answerable, responsible - subject, subservient, dependent

Engineering formal, logical, discursive rehabilitative, structural, constructive + coeducational, integrated, mixed + advanced, high, graduate

Science humanistic, humane, human-centered zealous, ardent, enthusiastic + humanistic, humane, human-centered + stoic, unemotional, chilly

Math & statistics enumerable, estimable, calculable formal, logical, discursive + enumerable, estimable, calculable - amenable, answerable, responsible

Social Sciences humanistic, humane, human-centered relational, relative, comparative + signiﬁcant, portentous, probative + humanistic, humane, human-centered

Table 2: The top two z-scored BERT-PROB axis poles, ordered from left to right, for each occupation category and

experiment. Each pole is represented by three example adjectives drawn from the set used to construct that pole.

Since the person experiment compares each occupation category to all others, + or - indicates the direction of the

shift in axis similarity. For example, sports occupations are still closer to responsible than irresponsible, just less

so (-) than other occupations.

yses, and human-reported associations (An et al.,

2018;Kwak et al.,2021;Kozlowski et al.,2019).

We perform external validation of self-consistent

axes on a dataset where people appear in a variety

of well-deﬁned and known contexts: occupations

from Wikipedia. We conduct two main experi-

ments. In the ﬁrst, we test whether contextualized

axes can detect differences across occupation terms,

and in the second, we investigate whether they can

detect differences across contexts.

4.1 Data

We collect eleven categories of unigram and bigram

occupations from Wikipedia lists: Writing, Enter-

tainment, Art, Health, Agriculture, Government,

Sports, Engineering, Science, Math & Statistics,

and Social sciences (Appendix A). The number of

occupations per category ranges from 3 in Math &

Statistics to 48 in Entertainment, with an average of

27.2. We use the MediaWiki API to ﬁnd Wikipedia

pages for occupations in each list if they exist and

follow redirects when necessary (e.g. Blogger redi-

rects to Blog). For each occupation’s singular form,

we extract sentences in its page that contains it. In

total, we have 3,015 sentences for 300 occupations.

4.2 Term-level experiment (occupations)

Each occupation is represented by a pre-trained

GloVe embedding or a BERT embedding averaged

over all occurrences on its page. If an axis uses

-scored adjective embeddings, we also

-score

the occupation embeddings compared to it. We

assign poles to occupations based on which side

of the axis they are closer to via cosine similarity.

Top poles are highly related to their target occupa-

tion category, as seen by the examples for

-scored

BERT-PROB in Table 2.

One limitation for interpretability is that word

Method Occupation Experiment Person Experiment

GLOVE 3.485 (±0.491) -

BERT-DEFAULT 3.576 (±0.429) 2.697 (±0.361)

BERT-DEFAULTz2.636 (±0.459) 2.485 (±0.367)

BERT-PROB 3.333 (±0.473) 2.667 (±0.363)

BERT-PROBz1.970 (±0.297) 2.152 (±0.404)

Table 3: Average rank of each axis-building method for

each experiment, across human evaluators and occupa-

tion categories. 95% CI in parentheses.

embeddings’ proximity can reﬂect any type of se-

mantic association, not just that a person actually

has the attributes of an adjective. For example,

adjectives related to unhealthy are highly associ-

ated with Health occupations, which can be ex-

plained by doctors working in environments where

unhealthiness is prominent. Therefore, embedding

distances only provide a foggy window into the na-

ture of words, and this ambiguity should be consid-

ered when interpreting word similarities and their

implications. This limitation applies to both static

embeddings and their contextualized counterparts.

We conduct human evaluation on this task of

using semantic axes to differentiate and charac-

terize occupations. Three student annotators ex-

amined the top three poles retrieved by each axis-

building approach and ranked these outputs based

on semantic relatedness to occupation categories

(Appendix B). These annotators had fair agree-

ment, with an average Kendall’s

of 0.629 across

categories and experiments. Though GLOVE is a

competitive baseline,

-scored BERT-PROB is the

highest-ranked approach overall (Table 3). This

suggests that more self-consistent axes also pro-

duce measurements that better reﬂect human judge-

ments of occupations’ general meaning.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DiscoveringDifferencesintheRepresentationofPeopleusingContextualizedSemanticAxesLiLucy,DivyaTadimeti,andDavidBammanUniversityofCalifornia,Berkeley{lucy3_li,dtadimeti,dbamman}@berkeley.eduAbstractAcommonparadigmforidentifyingseman-ticdifferencesacrosssocialandtemporalcon-textsistheuseofstaticwordembe...

展开>> 收起<<

Discovering Differences in the Representation of People using Contextualized Semantic Axes Li Lucy Divya Tadimeti and David Bamman.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Discovering Differences in the Representation of People using Contextualized Semantic Axes Li Lucy Divya Tadimeti and David Bamman

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: