Optimizing text representations to capture dissimilarity between political parties Tanise Ceron4Nico Blokker2Sebastian Padó4

2025-04-29 0 0 445.26KB 14 页 10玖币

侵权投诉

Optimizing text representations to capture

(dis)similarity between political parties

Tanise Ceron4Nico Blokker2Sebastian Padó4

4Institute for Natural Language Processing, University of Stuttgart, Germany

2Research Center on Inequality and Social Policy, University of Bremen, Germany

{tanise.ceron,pado}@ims.uni-stuttgart.de,blokker@uni-bremen.de

Abstract

Even though ﬁne-tuned neural language mod-

els have been pivotal in enabling “deep” auto-

matic text analysis, optimizing text represen-

tations for speciﬁc applications remains a cru-

cial bottleneck. In this study, we look at this

problem in the context of a task from compu-

tational social science, namely modeling pair-

wise similarities between political parties. Our

research question is what level of structural in-

formation is necessary to create robust text rep-

resentation, contrasting a strongly informed

approach (which uses both claim span and

claim category annotations) with approaches

that forgo one or both types of annotation with

document structure-based heuristics. Evaluat-

ing our models on the manifestos of German

parties for the 2021 federal election. We ﬁnd

that heuristics that maximize within-party over

between-party similarity along with a normal-

ization step lead to reliable party similarity pre-

diction, without the need for manual annota-

tion.

1 Introduction

A party manifesto, also known as electoral program,

is a document in which parties express their views,

intentions and motives for the next coming years.

Since this genre of text is written not just to inform,

but to persuade potential voters that the parties

compete for (Budge et al.,2001), it provides a

strong basis to understand the position taken by

parties according to various policies because of

its direct access to the parties’ opinions. Political

scientists study the contents of party manifestos,

for instance, to investigate parties’ similarity with

respect to the several policies (Budge,2003), to

predict party coalitions (Druckman et al.,2005),

and to evaluate the extent to which the parties that

they vote for actually corresponds to their own

world view (McGregor,2013).

To carry out systematic analyses of party rela-

tions while taking into account differences in style

and level of detail, these analyses are increasingly

grounded in two types of manual annotation about

claims, statements that contain a position or a view

towards an issue, that can be argued or demanded

for (Koopmans and Statham,1999): First, abstract

claim categories (Burst et al.,2021) are used to

group together diverse forms and formulations of

demands. Second, annotation often includes the

stance that parties take towards speciﬁc political

claims to abstract away from the many ways to ex-

press support or rejection in language. In addition,

these types of annotation offer a direct way to em-

pirically ground party similarity in claims and link

these to concrete textual statements. At the same

time, such manual annotation is extremely expen-

sive in terms of time and resources and has to be

repeated for every country and every new election.

In this paper, we investigate the extent to which

this manual effort can be reduced given appropri-

ate text representations. We build on the advances

made in recent years in neural language models

for text representations and present a series of ﬁne-

tuning designs based on manifesto texts to com-

pute party similarities. Our main hypothesis is that

the proximity between groups can be more easily

captured when the model receives adequate indica-

tion of the differences between groups (and their

stances) and this can be done via ﬁne-tuning for

instance. This can be achieved by using signal

that is freely available in the manifestos’ document

structure, such as groupings by party or topic. In-

formation of this type can serve as an alternative

feedback for ﬁne-tuning in order to create robust

text representations for analysing party proximity.

We ask three speciﬁc questions: (1) How to cre-

ate robust representations for identifying the simi-

larity between groups such as in the case of party

relations? (2) What level of document structure

is necessary for this purpose? (3) Can computa-

tional methods capture the relation between parties

in unstructured text? We empirically investigate

arXiv:2210.11989v1 [cs.CL] 21 Oct 2022

these questions on electoral programs from the Ger-

man 2021 elections, comparing party similarities

against a ground truth built from structured data.

We ﬁnd that our hypothesis is borne out: We can

achieve competitive results in modelling the party

proximity with textual data provided that the text

representations are optimized to capture the dif-

ferences across parties and normalized to fall in a

certain distribution that is appropriate for comput-

ing text similarity. More surprisingly, we ﬁnd that

completely unstructured data reach higher corre-

lations than more informed settings that consider

exclusively claims and/or their policy domain. We

make our code and data available for replicability.

Paper structure.

The paper is structured as fol-

lows. Section 2provides an overview of related

work. Section 3describes the data we work with

and our ground truth. Section 4presents our mod-

eling approach. Sections 5and 6discuss the exper-

imental setup and our results. Section 7concludes.

2 Related Work

2.1 Party Characterization

The characterization of parties is an important topic

in political science, and has previously been at-

tempted with NLP models. Most studies, however,

have focused on methods to place parties along the

left to right ideological dimension. For instance, an

early example is Laver et al. (2003) who investigate

the scaling of political texts associated with parties

(such as manifestos or legislative speeches) with a

bag of words approach in a supervised fashion, with

position scores provided by human domain experts.

Others, instead, have implemented unsupervised

methods for party positioning in order to avoid pick-

ing up on biases of the annotated data and to scale

up to large amounts of texts from different political

contexts while still implementing word frequency

methods (Slapin and Proksch,2008). More recent

studies have sought to overcome the drawbacks of

word frequency models such as topic reliance and

lack of similarity between synonymous pairs of

words, e.g. Glavaš et al. (2017) and Nanni et al.

(2022) implement a combination of distributional

semantics methods and a graph-based score propa-

gation algorithm for capturing the party positions

in the left-right dimension.

Our study differs from previous ones in two main

1https://github.com/tceron/capture_similarity_

between_political_parties.git

aspects. First, our aim is not to place parties a

left-to-right political dimension but to assess party

similarity in a latent multidimensional space of

policy positions and ideologies. Second, our focus

is not on the use of speciﬁc vocabulary, but on

representations of whole sentences. In other words,

our proposed models work well if they manage to

learn how political viewpoints are expressed at the

sentence level in party manifestos.

2.2 Optimizing Text Representations for

Similarity

Fine Tuning.

Recent years have seen rapid ad-

vances in the area of neural language models, in-

cluding models such as BERT, RoBERTa or GPT-

3 (Devlin et al.,2019;Liu et al.,2020;Brown

et al.,2020). The sentence-encoding capabilities

of these models make them generally applicable to

text classiﬁcation and similarity tasks (Cer et al.,

2018). Both for classiﬁcation and for similarity,

it was found that pre-trained models already show

respectable performance, but ﬁne-tuning them on

task-related data is crucial to optimize the models’

predictions – essentially telling the model which

aspects of the input matter for the task at hand.

On the similarity side, a well-known language

model is Sentence-BERT Reimers and Gurevych

(2019), a siamese and triplet network based on

BERT (Devlin et al.,2019) or RoBERTa (Liu et al.,

2020) which aims at better encoding the similar-

ities between sequences of text. Sentence-BERT

(SBERT) comes with its own ﬁne-tuning schema

which is informed by ranked pairs or triplets and

tunes the text representations to respect the pref-

erences expressed by the ﬁne-tuning data. Of

course, this raises the question of how to obtain

such ﬁne-tuning data: The study experiments both

with manually annotated datasets (for entailment

and paraphrasing tasks) and with the use of heuris-

tic document structure information, assuming that

sentences from the same Wikipedia section are se-

mantically closer and sentences from different sec-

tions are further away. Parallel results are also

found by Gao et al. (2021) in their SimCSE model,

which reach even better results when ﬁne-tuning

with contrastive learning: They also compare a

setting based on manually annotated data from an

inference dataset with a heuristic setting based on

combining a pair of sentences with its drop-out

version as positive examples and different pairs as

negative examples.

Party Sentence Domain

AfD

People’s insecurities and fears, especially in rural regions, must be taken

seriously.

Social Groups

CDU

We want to strengthen our Europe together with the citizens for the chal-

lenges of the future.

External

Relations

Linke

The policies of federal governments that ensure private corporations and

investors can make big money off our insurance premiums, co-pays and

exploitation of health care workers are endangering our health!

Political

System

FDP

In this way, we want to create incentives for a more balanced division of

family work between the parents.

Welfare and

Quality of Life

Grüne

After the pandemic, we do not want a return to unlimited growth in air

trafﬁc, but rather to align it with the goal of climate neutrality.

Economy

SPD

We advocate EU-wide ratiﬁcation of the Council of Europe’s Istanbul

Convention as a binding legal norm against violence against women.

Fabric of

Society

Table 1: Examples from the 2021 party manifestos and their annotated domains.

Both studies ﬁnd slightly lower performance for

the heuristic versions of their ﬁne-tuning datasets,

but still obtain a relevant improvement over the non-

ﬁne-tuned versions of their models, pointing to the

usefulness of heuristically generated ﬁne-tuning

data, for example based on document structure.

Postprocessing to Improve Embeddings

problem of the use of neural language models to

create text representations that was recognized re-

cently concerns the distributions of the resulting

embeddings: They turn out to be highly anisotropic

(Ethayarajh,2019;Gao et al.,2019), meaning that

their semantic space takes a cone rather than a

sphere format - in the former two random vectors

are highly correlated while in the latter they should

be highly uncorrelated. This can cause similarities

between tokens or sentences to be very similar even

when they should not. To counteract this tendency,

Li et al. (2020) impose an isotropic distribution

onto the embeddings via a ﬂow-based generative

model. Su et al. (2021) propose a lightweight, even

slightly more effective approach: The text embed-

dings undergo a linear so-called whitening transfor-

mation, which ensures that the bases of the space

are uncorrelated and each have a variance of 1.

3 Data

Before we describe the methods we will use, we

describe our textual basis and the ground truth we

will aim to approximate.

3.1 The Manifesto Dataset

As stated above, we are interested in deriving party

representations from party manifestos. Party mani-

festos generally contain sections roughly separated

by policy topics, however, some party manifestos

are organized more strictly by topics than others.

For this reason, we utilize the manifesto dataset pro-

vided by the Manifesto Project (Burst et al.,2021),

which provides manifestos from around the world

and offers consistent markup of policy domains

and categories 2.

More speciﬁcally, every sentence from the man-

ifestos is annotated with domain names and cate-

gories. In this paper, consistent with our goal of

reducing annotation effort, we consider only the

domain. The domain corresponds to a broad policy

ﬁeld such as ‘political system’ and ‘freedom and

democracy’. In most cases, an entire sentence is an-

notated with a single domain, but some sentences

have been split when falling into two distinct do-

mains. Nearly every sentence is annotated with a

domain label, except the introduction and end sec-

tions which usually contain an appeal to the voter

and do not belong to any policy category.

For reasons that will become clear in the next

subsection, we focus on German data and use the

party manifestos written by the six main German

parties (CDU/CSU, SPD, Grüne, Linke, FDP, AFD)

for the federal elections in 2013, 2017 and 2021.

Table 1shows some examples of sentences with

their respective domain names. Due to space con-

More information on

https://manifesto-project.

wzb.eu/information/documents/corpus

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Optimizingtextrepresentationstocapture(dis)similaritybetweenpoliticalpartiesTaniseCeron4NicoBlokker2SebastianPadó44InstituteforNaturalLanguageProcessing,UniversityofStuttgart,Germany2ResearchCenteronInequalityandSocialPolicy,UniversityofBremen,Germany{tanise.ceron,pado}@ims.uni-stuttgart.de,blokker@...

展开>> 收起<<

Optimizing text representations to capture dissimilarity between political parties Tanise Ceron4Nico Blokker2Sebastian Padó4.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Optimizing text representations to capture dissimilarity between political parties Tanise Ceron4Nico Blokker2Sebastian Padó4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: