Mining Duplicate Questions of Stack Overflow Mihir Sanjay Kale mihirsakandrew.cmu.eduAnirudha Rayasam

2025-05-02 0 0 809.76KB 8 页 10玖币
侵权投诉
Mining Duplicate Questions of Stack Overflow
Mihir Sanjay Kale
mihirsak@andrew.cmu.edu
Anirudha Rayasam
arayasam@andrew.cmu.edu
Radhika Parik
rparik@andrew.cmu.edu
Pranav Dheram
pdheram@andrew.cmu.edu
Abstract
There has a been a significant rise in the use of Community Question Answering
sites (CQAs) over the last decade owing primarily to their ability to leverage the
wisdom of the crowd. Duplicate questions have a crippling effect on the quality
of these sites. Tackling duplicate questions is therefore an important step towards
improving quality of CQAs. In this regard, we propose two neural network based
architectures for duplicate question detection on Stack Overflow. We also propose
explicitly modeling the code present in questions to achieve results that surpass the
state of the art.
1 Introduction
There has a been a significant rise in the use of Community Question Answering sites (CQAs) over
the last decade owing primarily to their ability to leverage a crowd’s collective intelligence. As CQAs
continue to serve an increasing number of people every year, they also gain from expanding audience
to become stronger in terms of the number of topics they cover. CQAs provide a platform where
users can ask questions about a wide range of topics and receive replies from their peers who are
better versed with these topics. Unfortunately, the increasing audience base has also made it hard
for these sites to keep track of the quality of their content. Deterioration of quality of questions and
answers has become more observable. Duplicate questions have a crippling effect on the quality of
these sites. They increase the number of irrelevant search results forcing users to search longer. They
also deter users from answering questions. Tackling duplicate questions is therefore an important
step towards improving quality of CQAs.
Consequently, there has been much research in the ‘duplicate question detection’ Muthmann and
Petrova (2014)Ahasanuzzaman et al. (2016)Bogdanova et al. (2015) domain. Much of the work so far
has used the text content of questions to build a predictive model for detecting duplicates. However,
to the best of our knowledge, there has been very little work analysing the code snippets in these
questions to identify duplicates. We believe that, in addition to using text content, we can leverage
the large number of code snippets available on sites like stackoverflow to detect duplicates. We also
present 2 architectures to effectively couple data from text and data from code to predict duplicate
questions.
The remainder of the paper is organized as follows. Section 2 briefly discusses previous work serving
as a motivation for our own work. Section 3 motivates the use of code as a feature. Section 4 presents
two baselines we compare our model with. Section 5 details our proposed approach and architecture.
Section 6 discusses metrics to evaluate the proposed approach and Section 7 details the proposed
timeline for carrying out this work and the conclusion is presented in Section 8.
arXiv:2210.01637v1 [cs.CL] 4 Oct 2022
2 Related Work
2.1 Predicting Q&A Quality
Shah et al. Shah and Pomerantz (2010) propose classification techniques to identify metrics that
predict answer quality on CQA sites. The study uses data from Yahoo Answers and builds classifiers
based on crowd-sourced reviews of the answers as well as text-based and context-based features such
as user profile, question length, number of answers etc extracted from the CQA website. The latter
approach was more promising in terms of predicting the asker’s satisfaction with a given answer,
than using human reviews on novelty, relevance etc. (which were found to be highly correlated). In
the context of detecting duplicate questions, robust techniques to gauge answer quality are useful in
ranking possible duplicate answers. The results of Shah et al also served as a motivation to explore
features derived from question metadata and text, rather than the realm of human subjectivity.
Ravi et al. Ravi et al. (2014) predict question quality using latent topic models. Their work goes
beyond the traditional bag of words model by leveraging latent structural clarity in questions. They
use Latent Dirichlet Allocation(LDA) to generate latent topics which are used to build a global topic
model(GTM). Better latent features can be extracted using LSTMs by leveraging time dependencies
in text which we’ll discuss in more detail in our approach. And Rajagopal et al. (2019) leverages
semantic role labels as features in a transfer learning setting for low resource domains.
In AmazonQA Gupta et al. (2019) the authors develop a new dataset from the e-commerce product
reviews and sets up the premise for a new reading comprehension task. They experiment with several
models to benchmark the performance on the new dataset, including language model and span based
QA models. Genetic algorithm based features selection Anirudha et al. (2014b,a,c,d) is another useful
approach to identify the important features among a large set of features available.
2.2 Graphical Analysis of CQA Networks
Wang et al. Wang et al. (2013) study the popular CQA platform Quora and attempt to identify
factors behind its sustained growth and popularity. The authors compare Quora with Stack Overflow
in terms distribution of content and activity, concluding that the platforms are quite similar. This
observation renders the paper’s findings applicable to the Stack Overflow CQA ecosystem as well.
The paper analyses three networks (graphs) from the Quora platform - user-topic Graph, social graph
and question graph. User topic interaction is identified as an important aspect in determining user and
question popularity, as is the quality and amount of content a user contributed to the site. The paper
also explores graph clustering techniques to detect similar questions, an approach easily extensible
to the detection of duplicate questions. The authors conclude that CQA websites face the challenge
of pushing relevant content to their user-base in the face of increased activity, and with it, increased
spam. This observation serves as an added motivation to our proposed research for detecting duplicate
questions, as it is seen that question duplication is the most common reason behind question deletion
on Stack Overflow.
2.3 Duplicate Question Detection
There is a growing body of work on detecting duplicate questions in social media QA. Ahasanuzzaman
et al. Ahasanuzzaman et al. (2016) specifically look at detecting duplicate questions on the Stack
Overflow platform. Their preliminary analysis establishes sufficient motivation for detecting and
removing duplicate questions in the context of improving user experience on CQA sites as well
as maintaining the website. The paper proposes to detect duplicates by training a discriminative
classifier on duplicate and non-duplicate question pairs, using text-based similarity features extracted
from question title, body, tags etc. Source code is explicitly mentioned as a feature with strong
discriminative power, but is modeled only in terms of text similarity. This aspect of the paper
motivated us to further explore the use of source code snippets in Stack Overflow to detect duplicate
questions, especially when the underlying question similarity is indiscernible simply by looking at
other features such as question title, question tags etc.
2
摘要:

MiningDuplicateQuestionsofStackOverowMihirSanjayKalemihirsak@andrew.cmu.eduAnirudhaRayasamarayasam@andrew.cmu.eduRadhikaParikrparik@andrew.cmu.eduPranavDherampdheram@andrew.cmu.eduAbstractTherehasabeenasignicantriseintheuseofCommunityQuestionAnsweringsites(CQAs)overthelastdecadeowingprimarilytothe...

展开>> 收起<<
Mining Duplicate Questions of Stack Overflow Mihir Sanjay Kale mihirsakandrew.cmu.eduAnirudha Rayasam.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:809.76KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注