The COVID That Wasnt Counterfactual Journalism Using GPT Sil Hamilton McGill University

2025-04-26 0 0 348.42KB 11 页 10玖币
侵权投诉
The COVID That Wasn’t: Counterfactual Journalism Using GPT
Sil Hamilton
McGill University
sil.hamilton@mcgill.ca
Andrew Piper
McGill University
andrew.piper@mcgill.ca
Abstract
In this paper, we explore the use of large
language models to assess human interpreta-
tions of real world events. To do so, we
use a language model trained prior to 2020
to artificially generate news articles concern-
ing COVID-19 given the headlines of actual
articles written during the pandemic. We then
compare stylistic qualities of our artificially
generated corpus with a news corpus, in this
case 5,082 articles produced by CBC News
between January 23 and May 5, 2020. We
find our artificially generated articles exhibits
a considerably more negative attitude towards
COVID and a significantly lower reliance on
geopolitical framing. Our methods and re-
sults hold importance for researchers seeking
to simulate large scale cultural processes via
recent breakthroughs in text generation.
1 Introduction
The rush to cover new COVID-19 developments
as the virus spread across the world over the first
half of 2020 induced a variety of editorial man-
dates from public broadcasters. Chief among these
was a desire to mitigate shock from a public un-
accustomed to large-scale public health emergen-
cies of the calibre COVID-19 presented. This de-
sire translated into systematic underreporting to-
gether with a reluctance to portray COVID-19 as
the danger it was (Quandt et al.,2021;Boberg et al.,
2020). Broadcasters in the United States (Zhao
et al.,2020), the United Kingdom (Garland and
Lilleker,2021), and Italy (Solomon et al.,2021) all
exhibited this phenomenon.
Although many studies have verified the above
effects, few if any studies to date have considered
alternative approaches the media could have taken
in their portrayal of COVID-19. Evaluating these
alternatives is critical given the close relationship
between media framing, public opinion, and gov-
ernment policy (Ogbodo et al.,2020;Lopes et al.,
2020).
In this paper we present a novel method of sim-
ulating media coverage of real world events using
Large Language Models (LLMs) as a means of in-
terpreting news industry biases. LLMs have been
used in a variety of settings to generate text for
real-world applications (Meng et al.,2022;Drori
et al.,2021). To our knowledge, they have not yet
been used as a tool for critically understanding the
interpretation of events through media coverage or
other forms of cultural framing.
To do so, we use Generative Pre-trained Trans-
former 2 (GPT-2), which was trained on text pro-
duced prior to the onset of COVID-19, to explore
how the Canadian Broadcasting Corporation (CBC)
covered COVID and how else they might have re-
ported on these breaking events. By generating
thousands of simulated articles, we show how such
“counterfactual journalism” can be used as a tool
for evaluating real-world texts.
2 Background
The COVID-19 pandemic has given researchers a
variety of opportunities to study human behavior
in response to a major public health crisis. One
core dimension of this experience is reflected in
the changing role that the media has played in com-
municating information to the public in a quickly
changing health environment (Van Aelst,2021;
Lilleker et al.,2021). Times of crises enshrine the
media as a valuable mediator between the public
and government.
This changing role registers itself in the edito-
rial policies at news corporations across the world.
Research published over the past two years has con-
firmed that public news broadcasters in Australia,
Sweden, and the United Kingdom all significantly
altered their editorial style in response to both so-
cietal and governmental pressures (Holland and
Lewis,2021;Shehata et al.,2021;Birks,2021)
during COVID-19.
What this research has so far lacked is the ability
arXiv:2210.06644v1 [cs.CL] 13 Oct 2022
to infer what could have been communicated, i.e.
what losses were entailed in these editorial shifts.
While a great deal of recent work has studied the
biases intrinsic to large language models
1
, no work
to date has used LLMs to study the biases of human
generated text. Reporting on real-world events in-
evitably requires complex choices of selection and
evaluation, i.e. which events and which actors to fo-
cus on along with modes of valuation surrounding
those choices. Simulating textual production given
similar prompts such as headlines can provide a
means of better understanding the editorial choices
made by news agencies.
In using a language model as a simulative mech-
anism, we draw on a long research tradition of us-
ing simulation to understand real-world processes.
Simulation has proven a boon for those working in
the sciences, including climate science and physics
(Winsberg,2010), and for those working in the so-
cial sciences, where agent-based social modelling
has led to advances in understanding complex so-
cial phenomena (Squazzoni et al.,2014). We seek
to bring these techniques to the study of cultural
behavior, where simulation has historically seen
less of an uptake (Manovich,2016).
3 Method
Our project consists of the following principal
steps:
1.
Create a news corpus drawn from our target
time-frame (15 January to 5 May 2020) whose
content is COVID-19 related.
2.
Fine-tune a language model whose generative
output is statistically similar to a random sam-
pling of our news source published before our
target time-frame, i.e. prior to COVID.
3.
Using this model, generate full-length text ar-
ticles using various prompts, including head-
lines and associated metadata.
4.
Compare generated text articles with the orig-
inal news corpus across key stylistic metrics.
2
3.1 Corpus
We first obtain a comprehensive collection of CBC
News’ online articles concerning COVID-19 pub-
lished between January and May 2020 from Kaggle
1
See Garrido-Muñoz et al. (2021) for a recent survey of
works investigating latent biases present within large language
models.
2We make our code available here.
(Han,2021). Our corpus contains 5,114 articles all
in the form of a headline, subheadline, byline, date
published, URL, and article text. Deduplicating
and cleaning the corpus with a series of
regex
filters leaves 5,082 articles spread across the first
four months of COVID-19.
3.2 Language Model
We use a Transformer-based large language model
(LLM) as our CBC simulacrum. We formalize our
model as follows: we define an article as a chain of
k
tokens. Let
X(d, θ)
be a probability distribution
representing the pulling of a token out from the
language model, where
d
is the article metadata
and
θ
are the prior weights. The probability of
drawing ktokens is then
Pr(¯xk) =
k
Y
i=1
Pr(X(d, θ) = xi|¯xi)(1)
where
¯xi
is the
ith
element of the vector
¯x
, and
¯xi
is the vector consisting of the first
i
elements of the
vector ¯x.
Selecting a pretrained language model suitable
for use as a base with which to further train with
specific writing samples is a non-trivial task given
the plurality of large language models released in
the past four years (HuggingFace,2022). We sur-
veyed models for candidates possessing the follow-
ing qualities:
the model must be neither egregious nor lack-
ing in parameter count;
domain-relevant samples must have been
present in the pretraining corpus;
and most importantly, the model must not be
aware of COVID.
Keeping with the above requirements, we se-
lect the medium-sized Generative Pre-trained
Transformer-2 (GPT-2) as distributed by OpenAI as
our candidate model. We found the medium-sized
GPT-2 model desirable because it is light enough
to be fine-tuned with a single consumer-level GPU;
CBC News was the 21
st
most frequent data source
OpenAI used in producing its training set (Clark,
2022) and the model was trained in 2018, two years
before the beginning of COVID-19.
3.2.1 Fine-tuning
Provided with sufficient context in the prompt, a
freshly obtained GPT-2 model produces qualita-
tively convincing news article text. It will, how-
ever, periodically confuse itself with exactly which
publication it is imitating, e.g. it can switch from
sounding like CNN to CBC to BBC in a single
text. For the purposes of comparison with a single
news source, it is thus necessary to fine-tune the
model with example texts encapsulating the desired
editorial and writing style.
Fine-tuning is a two-step process. We first gather
a sequence of texts best representing our target writ-
ing mode before fine-tuning a stock GPT-2 model
with the training dataset.
Training Dataset
We use a web scraper to ex-
tract a random selection of news articles published
between 2007 and 2020 from CBC News’ website.
We configure our scraper to pull the same metadata
as our COVID-19 dataset: headline, subheadline,
date, URL, and article text. We again deduplicate
to reduce the possibility of overfitting our model.
With this method we collect 1,368 articles with an
average length of 660 words per article.
We next construct a dictionary structure to for-
malize both our generation targets and to provide
GPT-2 a consistent interface with which to aid it in
logically linking together pieces of metadata. Pre-
vious research has indicated fine-tuning LLMs with
structured data aids the model in both understand-
ing and reacting to meaningful keywords (Ueda
et al.,2021). We therefore structure our fine-tuning
data in a dictionary. We provide a template of our
structure below.
{
’title’: ’Lorem ipsum...’,
’description’: ’Lorem ipsum...’,
’text’: ’Lorem ipsum...’
}
We produce one dictionary per article in our train-
ing set. We convert each dictionary to a string
before appending it to a final dataset text file with
which we train GPT-2.
Training
With our training dataset in hand, we
proceed to configure our training environment. We
use an Adam optimizer with a learning rate of
2e4
and run the process (Kingma and Ba,2014). Train-
ing the model for 20,000 steps over six hours results
in a final model achieving an average training loss
of 0.10.
3.2.2 Model Hyperparameters
In addition to fine-tuning our model, we experi-
ment with different hyperparameters and prompt
strategies. Numerous prior studies have described
the effects hyperparameter tuning has on the to-
ken generation process (van Stegeren and My´
sli-
wiec,2021;Xu et al.,2022). For our purposes,
we use three prompting strategies when generating
our synthetic news articles along with one further
parameter (temperature):
Standard Context
Only title and description
metadata are used as context dfor the model.
Static Context
In addition to the standard con-
text, we supply the model with an additional
framework
key containing a brief description of
the COVID-19 pandemic found on the website of
the Centre for Disease Control (CDC) in May 2020.
All generation iterations use the same description.
Rolling Context
We again supply the model
with an additional
framework
key, but keep the
description of COVID-19 contemporaneous with
the date of the real article in question. We again use
the CDC as a source but instead use the Internet
Archive’s Way Back Machine API to scrape dated
descriptions.3
Temperature
We manipulate the temperature
hyperparameter during generation with half-
percentage steps shifting the temperature between
0.1...1
. The temperature value is a divisor ap-
plied on the
softmax
operation on the returned
probability distribution, the affect of which effec-
tively controls the overall likelihood of the most
probable words. A high temperature results in a
more dynamic and random word choice, while a
lower temperature encourages those words which
are most likely according to the model’s priors.
Models
Manipulating the above hyperparemeters
gives us the following model framework:
Model 1: headline-only, temperature between
0.1 and 1
Model 2: static context, temperature between
0.1 and 1
Model 3: rolling context, temperature be-
tween 0.1 and 1
3https://archive.org/help/wayback_api.php
摘要:

TheCOVIDThatWasn't:CounterfactualJournalismUsingGPTSilHamiltonMcGillUniversitysil.hamilton@mcgill.caAndrewPiperMcGillUniversityandrew.piper@mcgill.caAbstractInthispaper,weexploretheuseoflargelanguagemodelstoassesshumaninterpreta-tionsofrealworldevents.Todoso,weusealanguagemodeltrainedpriorto2020toar...

展开>> 收起<<
The COVID That Wasnt Counterfactual Journalism Using GPT Sil Hamilton McGill University.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:348.42KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注