2 Gautam Choudhary, Natwar Modani, Nitish Maurya
Others, similar to (but not exactly the same as) [11]. We provide the reason for
our choice of these specific types and their justification in Section 2.2.
Text classification has long been an active area of research, as the classifi-
cation can help the users efficiently process a large amount of content. Finding
actionable comments on social media (tweets) was addressed in [13] using new
lexicon features. A specificity score was explored in [3] for an employee satisfac-
tion survey and product review settings to understand actionable suggestions
and grievances (complaints) for improvements. In yet another work [9], the ac-
tionability of review comments for code review is investigated using lexical fea-
tures. These works address only the actionability aspect of our problem, and the
datasets used in these papers are not publicly available except in [9], where the
dataset is made available publicly.
Other binary classifications in prior work include Question classification [15],
agreement/disagreement classification [1] and suggestions/advice mining [4]. How-
ever, such binary classifications only provide information on a single dimension
in isolation and fall short in providing a more extensive set of categorization as
done in [10], where the authors investigated comments on product reviews in
an e-commerce setting. Again, the datasets are not publicly available, and the
categories proposed are not comprehensive in nature.
OpenReview is a popular online forum for reviewing research papers and the
choice of gathering data from this forum is motivated by a comprehensive study
for analyzing the review process [14]. [6] also present PeerRead dataset con-
solidating reviews from a lot of conferences. Our dataset provides finer-grained
annotation by providing two labels per review comment sentence and thereby
opens up a new research direction.
Our key contributions in this paper are:
–A review comment dataset consisting of 1,250 labeled comments for iden-
tifying actionability and their types. We also have ∼52k+ unlabelled (but
otherwise processed) comments in this dataset for future extensions and/or
use of semi-supervised approaches.
–A taxonomy for types of review comments.
–Establishing strong baselines for the proposed dataset.
2 Dataset: ReAct
While the prior art focuses on feature engineering and model architecture, we
note a lack of publicly available datasets in this problem set. This section de-
scribes how we arrive at the proposed annotated dataset, ReAct.
In this paper, We use Fleiss’ kappa κ[5] as the measure of inter-annotator
agreement. It is used to determine the level of agreement between two or more
annotators when the response variable is measured on a categorical scale.
2.1 Raw Data Collection and Preprocessing
The proposed dataset is gathered from an online public forum OpenReview
where research papers are reviewed and discussed. Multiple anonymous reviewers