Discovering Differences in the Representation of People using Contextualized Semantic Axes Li Lucy Divya Tadimeti and David Bamman

2025-05-03 0 0 1.64MB 18 页 10玖币
侵权投诉
Discovering Differences in the Representation of
People using Contextualized Semantic Axes
Li Lucy, Divya Tadimeti, and David Bamman
University of California, Berkeley
{lucy3_li, dtadimeti, dbamman}@berkeley.edu
Abstract
A common paradigm for identifying seman-
tic differences across social and temporal con-
texts is the use of static word embeddings and
their distances. In particular, past work has
compared embeddings against “semantic axes”
that represent two opposing concepts. We ex-
tend this paradigm to BERT embeddings, and
construct contextualized axes that mitigate the
pitfall where antonyms have neighboring rep-
resentations. We validate and demonstrate
these axes on two people-centric datasets: oc-
cupations from Wikipedia, and multi-platform
discussions in extremist, men’s communities
over fourteen years. In both studies, contex-
tualized semantic axes can characterize differ-
ences among instances of the same word type.
In the latter study, we show that references to
women and the contexts around them have be-
come more detestable over time.
1 Introduction
Warning
: This paper contains content that may be
offensive or upsetting.
Quantifying and describing the nature of lan-
guage differences is key to measuring the impact
of social and cultural factors on text. Past work has
compared English embeddings for people to adjec-
tives or concepts (Garg et al.,2018;Mendelsohn
et al.,2020;Charlesworth et al.,2022), or projected
embeddings against axes representing contrasting
attributes (Turney and Littman,2003;An et al.,
2018;Kozlowski et al.,2019;Field and Tsvetkov,
2019;Mathew et al.,2020;Kwak et al.,2021;Lucy
and Bamman,2021b;Fraser et al.,2021;Grand
et al.,2022). Static representations for the same
word can also be juxtaposed across corpora that
reflect different time periods (Gonen et al.,2020;
Hamilton et al.,2016). This paradigm of using em-
bedding distances to uncover socially meaningful
patterns has also transferred over to studies that
measure biases in contextualized embeddings, such
as Wolfe and Caliskan (2021)’s finding that BERT
beautiful ugly
… gorgeous equipage,
ornately attired …
… a grotesque,
monster-like costume…
… there are beautiful
women who are willing …
… literally filled with
garbage women
Figure 1: An axis is constructed using embeddings of
adjectives in selected contexts. These contexts are pre-
dictive of synonyms, but not antonyms, of the target
adjective during masked language modeling. Token-
level embeddings for people are then projected onto
this axis.
embeddings of less frequent minority names are
closer to words related to unpleasantness.
The use of “semantic axes” is enticing in that it
offers an interpretable measurement of word differ-
ences beyond a single similarity value (Turney and
Littman,2003;An et al.,2018;Kozlowski et al.,
2019;Kwak et al.,2021). Words are projected onto
axes where the poles represent antonymous con-
cepts (such as beautifulugly), and the projected
embedding’s location along the axis indicates how
similar it is to either concept. Semantic axes con-
structed using static, type-based embeddings have
been used to analyze socially meaningful differ-
ences, such as words’ associations with class (Ko-
zlowski et al.,2019), or gender stereotypes in nar-
ratives (Huang et al.,2021;Lucy and Bamman,
2021b).
Our work investigates the extension and appli-
cation of semantic axes to contextualized embed-
dings. We present a novel approach for construct-
ing semantic axes with English BERT embeddings
(Figure 1). These axes are built to encourage self-
consistency, where antonymous poles are less con-
flated with each other. They are able to capture
semantic differences across word types as well as
variation in a single word across contexts. Their
ability to differentiate contexts makes them suitable
arXiv:2210.12170v1 [cs.CL] 21 Oct 2022
for studying how a word changes across domains
or across individual sentences. These axes are also
more self-consistent and coherent than ones created
using GloVe and other baseline approaches.
We demonstrate the use of contextualized axes
on two datasets: occupations from Wikipedia, and
people discussed in misogynistic online commu-
nities. We use the former as a case where terms
appear in definitional contexts, and characteristics
of people are well-known. In the latter longitudi-
nal, cross-platform case study, we examine lexical
choices made by communities whose attitudes to-
wards women tend to be salient and extreme. We
chose this set of online communities as a substan-
tive use case of our method, in light of recent at-
tention in web science on analyzing online extrem-
ism and hate at scale (e.g. Ribeiro et al.,2021b,a;
Aliapoulios et al.,2021). There, we analyze lan-
guage change and variation along axes through a
sociolinguistic lens, emphasizing that speakers use
language that reflects their social identities and be-
liefs (CH-Wang and Jurgens,2021;Huffaker and
Calvert,2017;Card et al.,2016;Lakoff and Fergu-
son,2006).
Our code, vocabularies, and other resources can
be found in our Github repo:
https://github.c
om/lucy3/context_semantic_axes.
2 Constructing semantic axes
Static embeddings.
Several formulae for calculat-
ing the similarity of a target word to two sets of
pole words have been proposed in prior work on
static semantic axes. These differ in whether they
take the difference between a target word’s simi-
larities to each pole (Turney and Littman,2003),
calculate a target word’s similarity to the differ-
ence between pole averages (An et al.,2018;Kwak
et al.,2021), or calculate a target word’s similar-
ity to the average of several word pair differences
that represent the same antonymous relationship
(Kozlowski et al.,2019). We build on the approach
of An et al. (2018) and Kwak et al. (2021), be-
cause it does not require us to curate multiple
paired antonyms for each axis, and it draws out
the difference between two concepts before a tar-
get word is compared to them, rather than after.
We define an axis
V
containing antonymous sets
of adjective vectors,
Sl={l1, l2, l3, ..., ln}
and
Sr={r1, r2, r3, ..., rm}, as the following:
V=1
n
n
X
i=1
li1
m
m
X
j=1
rj.
Relying on single-word poles for axes can be un-
stable to the choice of each word (An et al.,2018;
Antoniak and Mimno,2021). An et al. (2018) cre-
ates a pole’s set of words using the nearest neigh-
bors of a seed word, which may risk conflating
unintended meanings or antonymous neighbors
(Mrkši´
c et al.,2016;Sedoc et al.,2017). For exam-
ple, one axis uses the opposite seed words green
and experienced, but greens nearest neighbors in-
clude red rather than inexperienced. Instead of us-
ing this nearest neighbors approach, we construct
poles using WordNet antonym relations. Each end
of an axis aggregates synonymous and similar lem-
mas in WordNet synsets, which are expanded using
the similar to relation (Miller,1992).
Our type-based embedding baseline, GLOVE,
uses 300-dimensional GloVe vectors pretrained on
Wikipedia and Gigaword (Pennington et al.,2014).
We only keep poles where both sides have at least
three adjectives that appear in the GloVe vocab-
ulary, and we also exclude acronyms, which are
often more ambiguous in meaning. We start with
723 axes, where poles have on average 9.63 adjec-
tives each.
Contextualized embeddings.
Static embed-
dings, however, present a number of limitations.
Such embeddings cannot easily handle polysemy
or homonymy (Wiedemann et al.,2019), and even
when they are trained on different social or tem-
poral contexts, they require additional steps to be
aligned (Gonen et al.,2020). Context-specific em-
beddings also need enough training examples of
target words to create usable representations. These
limitations prevent the analysis of token-based se-
mantic variation, such as measuring how one men-
tion of a word is more or less beautiful than another.
Our main contribution of contextualized axes uses
the same WordNet-based formulation as our GloVe
baseline. Rather than each word in
Sl
or
Sr
being
represented by a single GloVe embedding, we ob-
tain BERT embeddings over multiple occurrences
of each adjective. We use BERT-base, as this model
is small enough for efficient application on large
datasets and is popular in previous work on seman-
tic change and differences (e.g. Hu et al.,2019;
Lucy and Bamman,2021a;Giulianelli et al.,2020;
Zhou et al.,2022;Coll Ardanuy et al.,2020;Mar-
tinc et al.,2020). It is also used in tutorials for
researchers outside of NLP, which means it has
high potential use in computational social science
and cultural analytics (Mimno et al.,2022).
For contextualized axes, we obtain a potential
pool of contexts for adjectives sampled over all of
Wikipedia from December 21, 2021, preprocessed
using Attardi (2015)’s text extractor. This sample
contains up to 1000 sentences, or contexts, that
contain each adjective, and we avoid contexts that
are too short (over 10 tokens) or too long (over 150
tokens).1
We experiment with two methods of obtaining
contextualized BERT embeddings for each adjec-
tive: a random “default" (BERT-DEFAULT) and one
where contexts are picked based on word probabili-
ties (BERT-PROB). For BERT-DEFAULT, we take a
random sample of 100 contextualized embeddings
across the adjectives in each pole. Since words
can be nearest neighbors with their antonyms in
semantic space (Mrkši´
c et al.,2016;Sedoc et al.,
2017), our main approach, BERT-PROB, aggregates
word embeddings over contexts that highlight con-
trasting meanings of axes’ poles.
To select contexts, we mask out the target ad-
jective in each of its 1000 sentences, and have
BERT-base predict the probabilities of synonyms
and antonyms for that masked token. We remove
contexts where the average probability of antonyms
is greater than that of synonyms, sort by average
synonym probability, and take the top 100 contexts.
One limitation of our approach is that predictions
are restricted to adjectives that can be represented
by one wordpiece token. If none of the words on a
pole of an axis appear in BERT’s vocabulary, we
backoff to BERT-DEFAULT to represent that axis.
For each axis type, we also have versions where
words’ embeddings are
z
-scored, which has been
shown to improve BERT’s alignment with humans’
word similarity judgements (Timkey and van Schi-
jndel,2021). For
z
-scoring, we calculate mean and
standard deviation BERT embeddings from a sam-
ple of around 370k whole words from Wikipedia.
As recommended by Bommasani et al. (2020), we
use mean pooling over wordpieces to produce word
representations when necessary, and we extend this
approach to create bigram representations as well.
These embeddings are a concatenation of the last
four layers of BERT, as these tend to capture more
context-specific information (Ethayarajh,2019).
1
This length cutoff made the data more manageable, and
90% of BERT’s training steps were originally on 128-length
sequences (Devlin et al.,2019).
Method Average C# of consistent axes
GLOVE 0.101 (0.006) 503
BERT-DEFAULT 0.084 (0.006) 393
BERT-DEFAULTz0.111 (0.007) 468
BERT-PROB 0.101 (0.006) 436
BERT-PROBz0.133 (0.007) 512
Table 1: A table of C, averaged across poles, with 95%
confidence intervals (CI) in parentheses. The zsymbol
represents z-scored approaches.
3 Internal validation
We internally validate our axes for self-consistency.
For each axis, we remove one adjective’s embed-
dings from either side, and compute its cosine sim-
ilarity to the axis constructed from the remaining
adjectives. For BERT approaches, we average the
adjective’s multiple embeddings to produce only
one before computing its similarity to the axis. In
a “consistent” axis, a left-out adjective should be
closer to the pole it belongs to. That is, if it be-
longs to
Sl
, its similarity to the axis should be pos-
itive. We average these leave-one-out similarities
for each pole, negating the score when the adjective
belongs to
Sr
, to produce a consistency metric,
C
.
Table 1shows
C
for different axis-building meth-
ods.
2
An axis is “consistent” if both of its poles
have C 0.
GLOVEs most inconsistent axis poles often in-
volve directions, such as east
west,left-handed
right-handed, and right
left. These concepts
may be difficult to learn from text without ground-
ing. We find that the various BERT approaches’
most inconsistent axes include direction-related
ones as well, but they also struggle to separate
concepts such as lower-class upper-class.
The best method for producing consistent axes
is
z
-scored BERT-PROB, with a significant dif-
ference in
C
from
z
-scored BERT-DEFAULT and
GLOVE (Mann-Whitney U-test,
p < 0.001
). It also
produces the highest number of consistent axes.
GLOVE presents itself as a formidable baseline,
3
and BERT-DEFAULT struggles in comparison to it.
4 External validation
Previous work on static semantic axes validates
them using sentiment lexicons, exploratory anal-
2
We assign
C
to 0 if only one unique adjective’s contexts
are chosen to create a pole for BERT-PROB, because in that
case, we are unable to run the leave-one-out test for that pole.
3
We also tried
z
-scoring GLOVE embeddings, but this wors-
ened internal consistency (C= 0.098).
Category Occupation Experiment Person Experiment
Writing creative, fanciful, fictive formal, logical, discursive + folksy, unceremonious, casual + ignoble, common, plebeian
Entertainment transcribed, taped, recorded structural, constructive, creative + trademarked, branded, copyrighted + emotional, soupy, slushy
Art unostentatious, aesthetic, artistic creative, fanciful, fictive + activist, active, hands-on + practiced, proficient, adept
Health unhealthy, pathologic, asthmatic rehabilitative, structural, constructive + confirmable, empirical, experiential + teetotal, dry, drug-free
Agriculture drifting, mobile, unsettled rustic, agrarian, bucolic + boneless, deboned, boned - rehabilitative, structural, constructive
Government amenable, answerable, responsible policy-making, political, governmental + respectful, deferential, honorific + amenable, answerable, responsible
Sports spry, gymnastic, sporty zealous, ardent, enthusiastic - amenable, answerable, responsible - subject, subservient, dependent
Engineering formal, logical, discursive rehabilitative, structural, constructive + coeducational, integrated, mixed + advanced, high, graduate
Science humanistic, humane, human-centered zealous, ardent, enthusiastic + humanistic, humane, human-centered + stoic, unemotional, chilly
Math & statistics enumerable, estimable, calculable formal, logical, discursive + enumerable, estimable, calculable - amenable, answerable, responsible
Social Sciences humanistic, humane, human-centered relational, relative, comparative + significant, portentous, probative + humanistic, humane, human-centered
Table 2: The top two z-scored BERT-PROB axis poles, ordered from left to right, for each occupation category and
experiment. Each pole is represented by three example adjectives drawn from the set used to construct that pole.
Since the person experiment compares each occupation category to all others, + or - indicates the direction of the
shift in axis similarity. For example, sports occupations are still closer to responsible than irresponsible, just less
so (-) than other occupations.
yses, and human-reported associations (An et al.,
2018;Kwak et al.,2021;Kozlowski et al.,2019).
We perform external validation of self-consistent
axes on a dataset where people appear in a variety
of well-defined and known contexts: occupations
from Wikipedia. We conduct two main experi-
ments. In the first, we test whether contextualized
axes can detect differences across occupation terms,
and in the second, we investigate whether they can
detect differences across contexts.
4.1 Data
We collect eleven categories of unigram and bigram
occupations from Wikipedia lists: Writing, Enter-
tainment, Art, Health, Agriculture, Government,
Sports, Engineering, Science, Math & Statistics,
and Social sciences (Appendix A). The number of
occupations per category ranges from 3 in Math &
Statistics to 48 in Entertainment, with an average of
27.2. We use the MediaWiki API to find Wikipedia
pages for occupations in each list if they exist and
follow redirects when necessary (e.g. Blogger redi-
rects to Blog). For each occupation’s singular form,
we extract sentences in its page that contains it. In
total, we have 3,015 sentences for 300 occupations.
4.2 Term-level experiment (occupations)
Each occupation is represented by a pre-trained
GloVe embedding or a BERT embedding averaged
over all occurrences on its page. If an axis uses
z
-scored adjective embeddings, we also
z
-score
the occupation embeddings compared to it. We
assign poles to occupations based on which side
of the axis they are closer to via cosine similarity.
Top poles are highly related to their target occupa-
tion category, as seen by the examples for
z
-scored
BERT-PROB in Table 2.
One limitation for interpretability is that word
Method Occupation Experiment Person Experiment
GLOVE 3.485 (±0.491) -
BERT-DEFAULT 3.576 (±0.429) 2.697 (±0.361)
BERT-DEFAULTz2.636 (±0.459) 2.485 (±0.367)
BERT-PROB 3.333 (±0.473) 2.667 (±0.363)
BERT-PROBz1.970 (±0.297) 2.152 (±0.404)
Table 3: Average rank of each axis-building method for
each experiment, across human evaluators and occupa-
tion categories. 95% CI in parentheses.
embeddings’ proximity can reflect any type of se-
mantic association, not just that a person actually
has the attributes of an adjective. For example,
adjectives related to unhealthy are highly associ-
ated with Health occupations, which can be ex-
plained by doctors working in environments where
unhealthiness is prominent. Therefore, embedding
distances only provide a foggy window into the na-
ture of words, and this ambiguity should be consid-
ered when interpreting word similarities and their
implications. This limitation applies to both static
embeddings and their contextualized counterparts.
We conduct human evaluation on this task of
using semantic axes to differentiate and charac-
terize occupations. Three student annotators ex-
amined the top three poles retrieved by each axis-
building approach and ranked these outputs based
on semantic relatedness to occupation categories
(Appendix B). These annotators had fair agree-
ment, with an average Kendall’s
W
of 0.629 across
categories and experiments. Though GLOVE is a
competitive baseline,
z
-scored BERT-PROB is the
highest-ranked approach overall (Table 3). This
suggests that more self-consistent axes also pro-
duce measurements that better reflect human judge-
ments of occupations’ general meaning.
摘要:

DiscoveringDifferencesintheRepresentationofPeopleusingContextualizedSemanticAxesLiLucy,DivyaTadimeti,andDavidBammanUniversityofCalifornia,Berkeley{lucy3_li,dtadimeti,dbamman}@berkeley.eduAbstractAcommonparadigmforidentifyingseman-ticdifferencesacrosssocialandtemporalcon-textsistheuseofstaticwordembe...

展开>> 收起<<
Discovering Differences in the Representation of People using Contextualized Semantic Axes Li Lucy Divya Tadimeti and David Bamman.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.64MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注