
Input P@5 P@3 nDCG@5 nDCG@3
Tweets 0.3922 0.2745 0.2733 0.2280
Spans 0.4407 0.3390 0.3038 0.2521
Table 1: nDCG@kand P@kscores for tweet and spans
using BM25 retrieval system and CORD19 dataset.
normalized Discounted Cumulative Gain (nDCG)
scores and report them in Table 1. For comparison,
we consider two different top-
k
settings (
k
=3 and
5). We begin by examining the retrieval perfor-
mance using P@
k
, which measures the fraction of
relevant documents extracted in the top-
k
set. Span-
based document retrieval consistently improves
precision scores when compared to tweets. For
nDCG@5, we discover that span-based retrieval
outperforms tweet-based retrieval by more than
3%
.
When we limit the retrieval depth to 3, we see a
similar pattern. This, in turn, demonstrates that
entire posts contain much extraneous information,
frequently impeding the performance of evidence
retrieval systems that are a prerequisite for both
automated and manual fact-checking. In summary,
we reinforce that our hypothesis positively stands
true, as span-based document retrieval results in a
better score for precision as well as nDCG. This
attests to the task’s feasibility and importance in
the realm of claims.
3 Related Work
Claims on Social Media.
The prevailing re-
search on claims could be cleft into three categories
– claim detection (Levy et al.,2014;Chakrabarty
et al.,2019;Gupta et al.,2021), claim check-
worthiness (Jaradat et al.,2018;Wright and Au-
genstein,2020), and claim verification (Zhi et al.,
2017;Hanselowski et al.,2018;Soleimani et al.,
2020). Bender et al. (2011) pioneered the efforts in
claim detection by introducing the AAWD corpus.
Subsequent studies largely relied on using linguisti-
cally motivated features such as sentiment, syntax,
context-free grammars, and parse-trees (Rosenthal
and McKeown,2012;Levy et al.,2014;Lippi and
Torroni,2015).
Recent works in claim detection have engen-
dered the use of large language models (LMs).
Chakrabarty et al. (2019) re-enforced the power
of fine-tuning, as their ULMFiT LM, fine-tuned on
a large Reddit corpus of about 5M opinionated
claims, showed notable improvements in claim
detection benchmark. Gupta et al. (2021) pro-
posed a generalized claim detection model for de-
tecting claims independent of its source. They
handled structured and unstructured data in con-
junction by training a blend of linguistic encoders
(POS and dependency trees) and a contextual en-
coder (BERT) to exploit the input text’s semantics
and syntax. As LMs account for significant com-
putational overheads, Sundriyal et al. (2021) ad-
dressed this quandary and proposed a lighter frame-
work that attempted to fabricate discernible feature
spaces. The CheckThat! Lab’s CLEF-
2020
shared
task (Barrón-Cedeno et al.,2020) has garnered the
attention of several researchers. Williams et al.
(2020) won the task by fine-tuning the RoBERTa
(Liu et al.,2019) accentuated by mean pooling
and dropout. Nikolov et al. (2020) ranked second
with their out-of-the-box RoBERTa vectors supple-
mented with Twitter meta-data.
Span Identification.
Zaidan et al. (2007) intro-
duced the concept of rationales, which highlighted
text segments that supported their label’s judgment.
Trautmann et al. (2020) released AURC-
8
dataset
with token-level span annotations for the argumen-
tative components of stance along with their cor-
responding label. Mathew et al. (2021) proposed
a quality corpus for explainable hate identification
with token-level annotations. The SemEval com-
munity has initiated fine-grained span identifica-
tion concerning other domains of argument mining
such as toxic comments (Pavlopoulos et al.,2021)
and propaganda techniques (Da San Martino et al.,
2020). These shared tasks amassed many solutions
constituting transformers (Chhablani et al.,2021),
convolutional neural networks (Coope et al.,2020),
data augmentation techniques (Rusert,2021;Plu-
ci´
nski and Klimczak,2021), and ensemble frame-
works (Zhu et al.,2021a;Nguyen et al.,2021).
Wührl and Klinger (2021) resembled the closest
study to ours, wherein they compiled a corpus of
around
1.2k
biomedical tweets with claim phrases.
In summary, existing literature on claims concen-
trates entirely on sentence-level claim identification
and does not investigate on eliciting fine-grained
claim spans. In this work, we endeavor to move
from coarse-grained claim detection to fine-grained
claim span identification. We consolidate a large
manually annotated Twitter dataset for claim span
identification task and benchmark it with various
baselines and a dedicated description-based model.
4 Dataset
Over the past few years, several claim detection
datasets have been released (Rosenthal and McK-