2 Related Work
2.1 Predicting Q&A Quality
Shah et al. Shah and Pomerantz (2010) propose classification techniques to identify metrics that
predict answer quality on CQA sites. The study uses data from Yahoo Answers and builds classifiers
based on crowd-sourced reviews of the answers as well as text-based and context-based features such
as user profile, question length, number of answers etc extracted from the CQA website. The latter
approach was more promising in terms of predicting the asker’s satisfaction with a given answer,
than using human reviews on novelty, relevance etc. (which were found to be highly correlated). In
the context of detecting duplicate questions, robust techniques to gauge answer quality are useful in
ranking possible duplicate answers. The results of Shah et al also served as a motivation to explore
features derived from question metadata and text, rather than the realm of human subjectivity.
Ravi et al. Ravi et al. (2014) predict question quality using latent topic models. Their work goes
beyond the traditional bag of words model by leveraging latent structural clarity in questions. They
use Latent Dirichlet Allocation(LDA) to generate latent topics which are used to build a global topic
model(GTM). Better latent features can be extracted using LSTMs by leveraging time dependencies
in text which we’ll discuss in more detail in our approach. And Rajagopal et al. (2019) leverages
semantic role labels as features in a transfer learning setting for low resource domains.
In AmazonQA Gupta et al. (2019) the authors develop a new dataset from the e-commerce product
reviews and sets up the premise for a new reading comprehension task. They experiment with several
models to benchmark the performance on the new dataset, including language model and span based
QA models. Genetic algorithm based features selection Anirudha et al. (2014b,a,c,d) is another useful
approach to identify the important features among a large set of features available.
2.2 Graphical Analysis of CQA Networks
Wang et al. Wang et al. (2013) study the popular CQA platform Quora and attempt to identify
factors behind its sustained growth and popularity. The authors compare Quora with Stack Overflow
in terms distribution of content and activity, concluding that the platforms are quite similar. This
observation renders the paper’s findings applicable to the Stack Overflow CQA ecosystem as well.
The paper analyses three networks (graphs) from the Quora platform - user-topic Graph, social graph
and question graph. User topic interaction is identified as an important aspect in determining user and
question popularity, as is the quality and amount of content a user contributed to the site. The paper
also explores graph clustering techniques to detect similar questions, an approach easily extensible
to the detection of duplicate questions. The authors conclude that CQA websites face the challenge
of pushing relevant content to their user-base in the face of increased activity, and with it, increased
spam. This observation serves as an added motivation to our proposed research for detecting duplicate
questions, as it is seen that question duplication is the most common reason behind question deletion
on Stack Overflow.
2.3 Duplicate Question Detection
There is a growing body of work on detecting duplicate questions in social media QA. Ahasanuzzaman
et al. Ahasanuzzaman et al. (2016) specifically look at detecting duplicate questions on the Stack
Overflow platform. Their preliminary analysis establishes sufficient motivation for detecting and
removing duplicate questions in the context of improving user experience on CQA sites as well
as maintaining the website. The paper proposes to detect duplicates by training a discriminative
classifier on duplicate and non-duplicate question pairs, using text-based similarity features extracted
from question title, body, tags etc. Source code is explicitly mentioned as a feature with strong
discriminative power, but is modeled only in terms of text similarity. This aspect of the paper
motivated us to further explore the use of source code snippets in Stack Overflow to detect duplicate
questions, especially when the underlying question similarity is indiscernible simply by looking at
other features such as question title, question tags etc.
2