
(2020) used graph-based methods to build text networks with
words, documents and labels and propagate labeling informa-
tion along the graph via embedding learning. Han and Shen
(2016) encoded weakly supervised information in positive
unlabelled learning tasks into pairwise constraints between
training instances imposing on graph embedding. Recently,
Islam and Goldwasser (2022a) proposed weakly supervised
graph embedding based EM-style framework to characterize
user types on social media. Our embedding model is similar
to contrastive learning-based embedding (Wu et al. 2020;
Giorgi et al. 2020). However, contrastive learning is self-
supervised, where labels are generated from the data without
any manual or weak label sources. In our case, we generate
the label using weak supervision. Our work is also closely
related to the entity-targeted sentiment analysis (Mohammad
et al. 2016; Field and Tsvetkov 2019; Mitchell et al. 2013;
Meng et al. 2012). In our work, we use weak supervision to
identify stance and issue of political ads and analyze political
campaigns. To the best of our knowledge, this is the first
work to utilize a weakly supervised graph embedding based
framework to analyze political campaigns on social media.
Data
We collect around
0.8
million political ads from January-
October 2020 using Facebook Ad Library API with the search
term ‘biden’, ‘harris’, ‘trump’, ‘pence’. All advertisements
are written in English. For each ad, the API provides the ad
ID, title, ad body, and URL, ad creation time and the time
span of the campaign, the Facebook page authoring the ad,
funding entity, the cost of the ad (given as a range). The API
also provides information on the users who have seen the
ad (called ‘impressions’): the total number of impressions
(given as a range and we take the average of the end points
of the range), distribution over impressions broken down by
gender (male, female, unknown), age (
7
groups), and location
down to states in the USA. We have duplicate content among
those collected ads because the same ad has been targeted to
different regions and demographics with unique ad id. We
have
35327
ads with different contents,
5431
unique fund-
ing entities, among them
537
explicitly mention candidate
names and/or party affiliations, e.g., BIDEN FOR PRESIDENT,
DONALD J. TRUMP FOR PRESIDENT, INC.
Holdout Data
For validation purpose, we manually annotate
667
ads for
stances and issues. We consider
4
stances ‘pro-biden’, ‘pro-
trump’, ‘anti-biden’, ‘anti-trump’ and
13
issues
§
called ‘abor-
tion’, ‘covid’, ‘climate’, ‘criminal justice reform, race, law
& order’, ‘economy and taxes’, ‘education’, ‘foreign pol-
icy’, ‘guns’, ‘healthcare’, ‘immigration’, ‘supreme court’,
‘terrorism’, ‘lgbtq’. We also mark ‘non-stance’, ‘non-issue’
ads. Two annotators from the Computer Science department
manually annotate a subset of ads to calculate inter-annotator
agreement using Cohen’s Kappa coefficient (Cohen 1960).
This subset has inter-annotator agreements of
77.50%
for
stance and
69.60%
for the issue, which are substantial agree-
ments. In case of a disagreement, we resolve it by discussion.
§https://ballotpedia.org/
ISSUE(UNI,BI,TRI) ISSUE(UNI,BI,TRI)
Abortion (56, 20, 1) Foreign policy (95, 31, 6)
Covid (52, 23, 5) Guns (92, 20, 6)
Climate (66, 22, 3) Healthcare (62, 21, 4)
Criminal justice reform, race Immigration (78, 25, 3)
law & order (93, 26, 5) Supreme court (80, 25, 4)
Economy & taxes (41, 16, 2) Terrorism (73, 19, 3)
Education (62, 22, 2) LGBTQ (55, 12, 1)
Table 1: Number of unigram, bigram, trigram in each issue.
The rest of the data was annotated by one graduate student
from the Computer Science department.
Methodology
We represent political advertising activity on social media as
a graph, connecting funding entities to their ads. We repre-
sent the outcome of our analysis, stance and issue predictions,
as separate label nodes in the graph connected via edges to
ads and funding entities. Each issue label-node is associated
with an
n
-gram lexicon, a set of nodes representing lexical
indicators for the issue. Based on known associations be-
tween funding entities and stances, we associate
10%
of the
funding entities and their ads with stance labels. The lexicon
and observed stance relations act as a weak form of super-
vision for graph embedding. Our model learns to generalize
the stance predictor to new ads, and by contextualizing the
lexicon
n
-grams based on their occurrence in ads, we learn
to associate other ads with the relevant issue even when the
lexicon items are not present. These settings are described
in Fig. 1. Note that each ad can be associated with multiple
issues and stances (e.g., pro-biden and anti-trump).
Issue Lexicon
To create the issue lexicon, we collect
30
news articles
covering each issue from left leaning, right leaning, and
neutral news media. We know the news source bias from
https://mediabiasfactcheck.com/. We calculate the Pointwise
Mutual Information (PMI) (Church and Hanks 1990) to iden-
tify issue-specific lexicons. We calculate the PMI for an
n
-
gram,
w
with issue,
i
as
P MI(w, i) = log P(w|i)
P(w)
. To com-
pute
P(w|i)
, we take all news articles related to an issue
i
and compute
count(w)
count(all ngrams)
. We have
30
news articles per
issue.
P(w)
is computed by counting
n
-gram,
w
over the
whole corpus (
390
news articles). We assign each
n
-gram
to the issue with the highest PMI and build an
n
-gram lexi-
con for each issue. Table 1 shows the number of unigrams,
bigrams, and trigrams with PMI
≥0.5
per issue. In this pa-
per, we use only unigrams, resulting in
905
issue-indicating
words.
Model
To identify stances and issues, we do the followings:
Inferring Stance Labels Using Knowledge.
In some cases,
the names of funding entities capture their bias. For example:
‘Biden Victory Fund’, ‘Keep Trump in office’ clearly state