Mining Duplicate Questions of Stack Overﬂow Mihir Sanjay Kale mihirsakandrew.cmu.eduAnirudha Rayasam

2025-05-02 0 0 809.76KB 8 页 10玖币

侵权投诉

Mining Duplicate Questions of Stack Overﬂow

Mihir Sanjay Kale

mihirsak@andrew.cmu.edu

Anirudha Rayasam

arayasam@andrew.cmu.edu

Radhika Parik

rparik@andrew.cmu.edu

Pranav Dheram

pdheram@andrew.cmu.edu

Abstract

There has a been a signiﬁcant rise in the use of Community Question Answering

sites (CQAs) over the last decade owing primarily to their ability to leverage the

wisdom of the crowd. Duplicate questions have a crippling effect on the quality

of these sites. Tackling duplicate questions is therefore an important step towards

improving quality of CQAs. In this regard, we propose two neural network based

architectures for duplicate question detection on Stack Overﬂow. We also propose

explicitly modeling the code present in questions to achieve results that surpass the

state of the art.

1 Introduction

There has a been a signiﬁcant rise in the use of Community Question Answering sites (CQAs) over

the last decade owing primarily to their ability to leverage a crowd’s collective intelligence. As CQAs

continue to serve an increasing number of people every year, they also gain from expanding audience

to become stronger in terms of the number of topics they cover. CQAs provide a platform where

users can ask questions about a wide range of topics and receive replies from their peers who are

better versed with these topics. Unfortunately, the increasing audience base has also made it hard

for these sites to keep track of the quality of their content. Deterioration of quality of questions and

answers has become more observable. Duplicate questions have a crippling effect on the quality of

these sites. They increase the number of irrelevant search results forcing users to search longer. They

also deter users from answering questions. Tackling duplicate questions is therefore an important

step towards improving quality of CQAs.

Consequently, there has been much research in the ‘duplicate question detection’ Muthmann and

Petrova (2014)Ahasanuzzaman et al. (2016)Bogdanova et al. (2015) domain. Much of the work so far

has used the text content of questions to build a predictive model for detecting duplicates. However,

to the best of our knowledge, there has been very little work analysing the code snippets in these

questions to identify duplicates. We believe that, in addition to using text content, we can leverage

the large number of code snippets available on sites like stackoverﬂow to detect duplicates. We also

present 2 architectures to effectively couple data from text and data from code to predict duplicate

questions.

The remainder of the paper is organized as follows. Section 2 brieﬂy discusses previous work serving

as a motivation for our own work. Section 3 motivates the use of code as a feature. Section 4 presents

two baselines we compare our model with. Section 5 details our proposed approach and architecture.

Section 6 discusses metrics to evaluate the proposed approach and Section 7 details the proposed

timeline for carrying out this work and the conclusion is presented in Section 8.

arXiv:2210.01637v1 [cs.CL] 4 Oct 2022

2 Related Work

2.1 Predicting Q&A Quality

Shah et al. Shah and Pomerantz (2010) propose classiﬁcation techniques to identify metrics that

predict answer quality on CQA sites. The study uses data from Yahoo Answers and builds classiﬁers

based on crowd-sourced reviews of the answers as well as text-based and context-based features such

as user proﬁle, question length, number of answers etc extracted from the CQA website. The latter

approach was more promising in terms of predicting the asker’s satisfaction with a given answer,

than using human reviews on novelty, relevance etc. (which were found to be highly correlated). In

the context of detecting duplicate questions, robust techniques to gauge answer quality are useful in

ranking possible duplicate answers. The results of Shah et al also served as a motivation to explore

features derived from question metadata and text, rather than the realm of human subjectivity.

Ravi et al. Ravi et al. (2014) predict question quality using latent topic models. Their work goes

beyond the traditional bag of words model by leveraging latent structural clarity in questions. They

use Latent Dirichlet Allocation(LDA) to generate latent topics which are used to build a global topic

model(GTM). Better latent features can be extracted using LSTMs by leveraging time dependencies

in text which we’ll discuss in more detail in our approach. And Rajagopal et al. (2019) leverages

semantic role labels as features in a transfer learning setting for low resource domains.

In AmazonQA Gupta et al. (2019) the authors develop a new dataset from the e-commerce product

reviews and sets up the premise for a new reading comprehension task. They experiment with several

models to benchmark the performance on the new dataset, including language model and span based

QA models. Genetic algorithm based features selection Anirudha et al. (2014b,a,c,d) is another useful

approach to identify the important features among a large set of features available.

2.2 Graphical Analysis of CQA Networks

Wang et al. Wang et al. (2013) study the popular CQA platform Quora and attempt to identify

factors behind its sustained growth and popularity. The authors compare Quora with Stack Overﬂow

in terms distribution of content and activity, concluding that the platforms are quite similar. This

observation renders the paper’s ﬁndings applicable to the Stack Overﬂow CQA ecosystem as well.

The paper analyses three networks (graphs) from the Quora platform - user-topic Graph, social graph

and question graph. User topic interaction is identiﬁed as an important aspect in determining user and

question popularity, as is the quality and amount of content a user contributed to the site. The paper

also explores graph clustering techniques to detect similar questions, an approach easily extensible

to the detection of duplicate questions. The authors conclude that CQA websites face the challenge

of pushing relevant content to their user-base in the face of increased activity, and with it, increased

spam. This observation serves as an added motivation to our proposed research for detecting duplicate

questions, as it is seen that question duplication is the most common reason behind question deletion

on Stack Overﬂow.

2.3 Duplicate Question Detection

There is a growing body of work on detecting duplicate questions in social media QA. Ahasanuzzaman

et al. Ahasanuzzaman et al. (2016) speciﬁcally look at detecting duplicate questions on the Stack

Overﬂow platform. Their preliminary analysis establishes sufﬁcient motivation for detecting and

removing duplicate questions in the context of improving user experience on CQA sites as well

as maintaining the website. The paper proposes to detect duplicates by training a discriminative

classiﬁer on duplicate and non-duplicate question pairs, using text-based similarity features extracted

from question title, body, tags etc. Source code is explicitly mentioned as a feature with strong

discriminative power, but is modeled only in terms of text similarity. This aspect of the paper

motivated us to further explore the use of source code snippets in Stack Overﬂow to detect duplicate

questions, especially when the underlying question similarity is indiscernible simply by looking at

other features such as question title, question tags etc.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MiningDuplicateQuestionsofStackOverowMihirSanjayKalemihirsak@andrew.cmu.eduAnirudhaRayasamarayasam@andrew.cmu.eduRadhikaParikrparik@andrew.cmu.eduPranavDherampdheram@andrew.cmu.eduAbstractTherehasabeenasignicantriseintheuseofCommunityQuestionAnsweringsites(CQAs)overthelastdecadeowingprimarilytothe...

展开>> 收起<<

Mining Duplicate Questions of Stack Overﬂow Mihir Sanjay Kale mihirsakandrew.cmu.eduAnirudha Rayasam.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Mining Duplicate Questions of Stack Overﬂow Mihir Sanjay Kale mihirsakandrew.cmu.eduAnirudha Rayasam

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: