
Category Occupation Experiment Person Experiment
Writing creative, fanciful, fictive formal, logical, discursive + folksy, unceremonious, casual + ignoble, common, plebeian
Entertainment transcribed, taped, recorded structural, constructive, creative + trademarked, branded, copyrighted + emotional, soupy, slushy
Art unostentatious, aesthetic, artistic creative, fanciful, fictive + activist, active, hands-on + practiced, proficient, adept
Health unhealthy, pathologic, asthmatic rehabilitative, structural, constructive + confirmable, empirical, experiential + teetotal, dry, drug-free
Agriculture drifting, mobile, unsettled rustic, agrarian, bucolic + boneless, deboned, boned - rehabilitative, structural, constructive
Government amenable, answerable, responsible policy-making, political, governmental + respectful, deferential, honorific + amenable, answerable, responsible
Sports spry, gymnastic, sporty zealous, ardent, enthusiastic - amenable, answerable, responsible - subject, subservient, dependent
Engineering formal, logical, discursive rehabilitative, structural, constructive + coeducational, integrated, mixed + advanced, high, graduate
Science humanistic, humane, human-centered zealous, ardent, enthusiastic + humanistic, humane, human-centered + stoic, unemotional, chilly
Math & statistics enumerable, estimable, calculable formal, logical, discursive + enumerable, estimable, calculable - amenable, answerable, responsible
Social Sciences humanistic, humane, human-centered relational, relative, comparative + significant, portentous, probative + humanistic, humane, human-centered
Table 2: The top two z-scored BERT-PROB axis poles, ordered from left to right, for each occupation category and
experiment. Each pole is represented by three example adjectives drawn from the set used to construct that pole.
Since the person experiment compares each occupation category to all others, + or - indicates the direction of the
shift in axis similarity. For example, sports occupations are still closer to responsible than irresponsible, just less
so (-) than other occupations.
yses, and human-reported associations (An et al.,
2018;Kwak et al.,2021;Kozlowski et al.,2019).
We perform external validation of self-consistent
axes on a dataset where people appear in a variety
of well-defined and known contexts: occupations
from Wikipedia. We conduct two main experi-
ments. In the first, we test whether contextualized
axes can detect differences across occupation terms,
and in the second, we investigate whether they can
detect differences across contexts.
4.1 Data
We collect eleven categories of unigram and bigram
occupations from Wikipedia lists: Writing, Enter-
tainment, Art, Health, Agriculture, Government,
Sports, Engineering, Science, Math & Statistics,
and Social sciences (Appendix A). The number of
occupations per category ranges from 3 in Math &
Statistics to 48 in Entertainment, with an average of
27.2. We use the MediaWiki API to find Wikipedia
pages for occupations in each list if they exist and
follow redirects when necessary (e.g. Blogger redi-
rects to Blog). For each occupation’s singular form,
we extract sentences in its page that contains it. In
total, we have 3,015 sentences for 300 occupations.
4.2 Term-level experiment (occupations)
Each occupation is represented by a pre-trained
GloVe embedding or a BERT embedding averaged
over all occurrences on its page. If an axis uses
z
-scored adjective embeddings, we also
z
-score
the occupation embeddings compared to it. We
assign poles to occupations based on which side
of the axis they are closer to via cosine similarity.
Top poles are highly related to their target occupa-
tion category, as seen by the examples for
z
-scored
BERT-PROB in Table 2.
One limitation for interpretability is that word
Method Occupation Experiment Person Experiment
GLOVE 3.485 (±0.491) -
BERT-DEFAULT 3.576 (±0.429) 2.697 (±0.361)
BERT-DEFAULTz2.636 (±0.459) 2.485 (±0.367)
BERT-PROB 3.333 (±0.473) 2.667 (±0.363)
BERT-PROBz1.970 (±0.297) 2.152 (±0.404)
Table 3: Average rank of each axis-building method for
each experiment, across human evaluators and occupa-
tion categories. 95% CI in parentheses.
embeddings’ proximity can reflect any type of se-
mantic association, not just that a person actually
has the attributes of an adjective. For example,
adjectives related to unhealthy are highly associ-
ated with Health occupations, which can be ex-
plained by doctors working in environments where
unhealthiness is prominent. Therefore, embedding
distances only provide a foggy window into the na-
ture of words, and this ambiguity should be consid-
ered when interpreting word similarities and their
implications. This limitation applies to both static
embeddings and their contextualized counterparts.
We conduct human evaluation on this task of
using semantic axes to differentiate and charac-
terize occupations. Three student annotators ex-
amined the top three poles retrieved by each axis-
building approach and ranked these outputs based
on semantic relatedness to occupation categories
(Appendix B). These annotators had fair agree-
ment, with an average Kendall’s
W
of 0.629 across
categories and experiments. Though GLOVE is a
competitive baseline,
z
-scored BERT-PROB is the
highest-ranked approach overall (Table 3). This
suggests that more self-consistent axes also pro-
duce measurements that better reflect human judge-
ments of occupations’ general meaning.