
a Chinese Mandarin-language version of Twitter (McCarthy
and Xiong 2022). We collect the posts of the accounts of
the seven different Chinese state media organizations from
our news article dataset (for the CGTN news organization,
we collect the Weibo posts of its CGTN and CGTN journal-
ist group/CGTN记者团accounts). To help quantify the con-
nection of each of these media organizations to the Russian
government and Russian state media, we further scrape the
accounts of the Russian Embassy/@俄罗斯驻华大使馆,
Russia Today/@今日俄罗斯RT, and Sputnik News/@俄
罗斯卫星通讯社. Lastly, in addition to our Chinese news
organizations’ Weibo accounts and Russian state-sponsored
Weibo accounts, we collect the Weibo posts of the 200 users
who most prominently discussed the Russo-Ukrainian con-
flict at the end of February as labeled by Fung et al. (Fung
and Ji 2022). This list was manually created from users who
“actively posted about and ranked among the top posts of
trending hashtags related to the Russo-Ukrainian war.” Af-
ter combining our lists of Weibo users, and removing in-
active and duplicate accounts, we had a total of 191 dis-
tinct accounts. For each account in our dataset, we scraped
the account on four occasions (March 14, March 28, April
06, and April 16) to ensure our dataset was comprehen-
sive. To scrape each Weibo account, we utilize the Python
weibo-scraper tool.1Ultimately, our dataset consists
of 191 different accounts and 343,435 distinct Weibo posts
from between January 1 and April 15, 2022.
Twitter Dataset. In addition to our Weibo dataset, we fur-
ther collect the tweets of the seven different news Chinese
news outlets within our news article dataset (China Daily,
CGTN, Global Times, Chinese News Service, Xin Hua, Peo-
ple’s Daily, and CCTV). Unlike for our Weibo dataset, we do
not collect the set of Chinese users who most prominently
discussed the Russo-Ukrainian conflict on Twitter (Twitter
has been banned in China since 2009 (Barry 2022)), limit-
ing our Twitter analysis to these seven major state-sponsored
Chinese outlets who also regularly tweet. To investigate
these accounts’ connection to the Russian government and
Russian news media, we again collect the tweets of the Rus-
sianEmbassy/@RussianEmbassy, Russia Today/@RT com,
and Sputnik News/@SputnikInt. We collect the tweets of
each account using the Tweepy API (Roesslein 2009) on
four different instances (March 06, March 13, April 02, and
April 16). Ultimately, our Twitter dataset consists of 62,717
unique tweets from 10 different accounts from January 1 and
April 15, 2022.
Pointwise Mutual Information. To determine different
news ecosystems’ associations with distinct words, we uti-
lize the normalized pointwise mutual information metric.
Pointwise mutual information (PMI) is an information-
theoretic measure for discovering associations amongst
words (Bouma 2009). However, as in Kessler et al., rather
than finding the pointwise mutual information between dif-
ferent words, we utilize this measure to understand words’
association with different categories (Kessler 2017). In this
way, we seek to identify the characteristic words of each
ecosystem’s coverage of the Russo-Ukrainian War (i.e.,
1https://github.com/Xarrow/weibo-scraper
Western, Chinese, and Russian media). We utilize the nor-
malized and scaled version of PMI to prevent our metric
from being biased towards rarely occurring words and to in-
crease interpretability. Scaled normalized PMI (NMPI) for a
wordiand each category Cjis calculated as follows:
P MI(wordi, Cj) = log2
P(wordi, Cj)
P(wordi)P(ci)
NP MI(wordi, Cj) = P MI(wordi, Cj)
−log2(P(wordi, Cj))
where Pis the probability of occurrence and a scaling
parameter αis added to the counts of each word. NPMI
ranges between (-1,1). We choose α= 50 given the size
of our dataset (Turney 2001). An NPMI value of −1rep-
resents that the word and the category never occur together
(given that we utilize the scaled version this never occurs),
0 represents independence, and +1 represents perfect co-
occurrence (Bouma 2009). Finally, before computing NMPI
on our dataset, we first lemmatize and remove stop words as
in prior work (Zannettou et al. 2020).
Partially Labelled Latent Dirichlet Allocation. In addition
to identifying words characteristic of each news ecosystem,
we also extract the set of topics that are distinctive to each
ecosystem. To do this, we utilize Partially Labelled Dirichlet
Allocation (PLDA). PLDA is an extension of the widely-
used topic analysis algorithm Latent Dirichlet Allocation
(LDA) (Ramage, Manning, and Dumais 2011). PLDA, like
LDA, assumes that each document is composed of a distri-
bution of different topics (which themselves are composed
as a distribution of different words). However, unlike LDA,
each document can form topics from a pool associated with
one or more of its specific labels. For example, a newspa-
per article from nytimes.com, which is labeled as “Western”,
can draw from a set of labeled topics associated with “West-
ern” (as opposed to an article from chinadaily.com.cn which
can draw from a set of labeled topics associated with “Chi-
nese”). In addition to drawing from the distribution of top-
ics associated with its labels, documents also further draw
from a pool of latent topics that are associated with every
document in the dataset. PLDA can thus model the topics
that are common to every document while also identifying
discriminating topics for each label (i.e., topics specific to
“Western”, “Russian”, “Chinese”).
Again when fitting our PLDA model, we first lemmatize
and remove stop words. When computing topics, we further
weight words using term-frequency inverse document fre-
quency (TF-IDF). Previous work has shown that this weight-
ing leads to more accurate topics (Zannettou et al. 2020).
To find the appropriate amount of topics, we optimize the
word2vec topic coherence score cvthat measures the se-
mantic similarity among extracted topic words (Zannettou
et al. 2020). We utilize a baseline number of 300 latent top-
ics, varying the number of topics per label from 1 to 20. We
achieve the best coherence score of 0.46 with 15 topics as-
sociated with each label (345 total topics).