An Empirical Study on How the Developers Discussed about Pandas Topics Sajib Kumar Saha Joy Farzad Ahmed Al Hasib Mahamud and Nibir_2

2025-04-30 0 0 2.4MB 15 页 10玖币
侵权投诉
An Empirical Study on How the Developers
Discussed about Pandas Topics
Sajib Kumar Saha Joy?, Farzad Ahmed?, Al Hasib Mahamud?, and Nibir
Chandra Mandal
Ahsanullah University of Science and Technology, Bangladesh
{joyjft, farzadahmed6}@gmail.com, {hasib, nibir}.cse@aust.edu
Abstract. Pandas is defined as a software library which is used for data
analysis in Python programming language. As pandas is a fast, easy and
open source data analysis tool, it is rapidly used in different software
engineering projects like software development, machine learning, com-
puter vision, natural language processing, robotics, and others. So a huge
interests are shown in software developers regarding pandas and a huge
number of discussions are now becoming dominant in online developer
forums, like Stack Overflow (SO). Such discussions can help to under-
stand the popularity of pandas library and also can help to understand
the importance, prevalence, difficulties of pandas topics. The main aim
of this research paper is to find the popularity and difficulty of pandas
topics. For this regard, SO posts are collected which are related to pan-
das topic discussions. Topic modeling are done on the textual contents of
the posts. We found 26 topics which we further categorized into 5 board
categories. We observed that developers discuss variety of pandas topics
in SO related to error and excepting handling, visualization, External
support, dataframe, and optimization. In addition, a trend chart is gen-
erated according to the discussion of topics in a predefined time series.
The finding of this paper can provide a path to help the developers, ed-
ucators and learners. For example, beginner developers can learn most
important topics in pandas which are essential for develop any model.
Educators can understand the topics which seem hard to learners and
can build different tutorials which can make that pandas topic under-
standable. From this empirical study it is possible to understand the
preferences of developers in pandas topic by processing their SO posts.
Keywords: Pandas ·Stack Overflow ·Natural Language Processing ·
Empirical Software Engineering.
1 Introduction
Pandas is an open source library where python package offers data manipulation
and analysis in python programming language. As pandas has high performance
and fast productivity for users, it’s data analysis capability is utilized in differ-
ent sectors of computer science like data visualization, machine learning, data
?First Three authors contributed equally to this research.
arXiv:2210.03519v2 [cs.SE] 10 May 2023
2 Joy et al.
driven software engineering, computer vision, natural language processing etc.
Since 2008, the development of pandas has removed the distance between the
availability of data analysis tools [4]. Pandas is considered for most suitable op-
tion for data analysis tool as it is written in python programming language and
easy to understand for new beginner [5].
In the recent years, the utilization of pandas library is increased rapidly as pan-
das library has reduced the gap between scientific programming languages and
database languages [17]. Pandas library is utilized in most of the sectors like
machine learning, statistics, natural language processing, computer vision and
others. Moreover, pandas library is easy to understand for a beginner and it is
open source tool. For these reasons, most of the developers are now showing
interest to utilize pandas tools in their projects. For the development of pandas
library and it’s utilization, a factor is observed that the discussions regarding
pandas in online developers forums has increased, such as Stack Overflow (SO).
From analysing these post, several findings can be achieved related to pandas li-
brary like it’s popularity, difficulties, future scopes etc. To date, there are around
22.44 million questions are posted in SO [6].Several research works are conducted
based on SO posts in the field of IoT [8], blockchain [9], microservices [10], soft-
ware engineering [8].Some research works are also done based on the function-
ality, popularity, scope to development of pandas library [11–13]. According to
the best of our knowledge, no research work is done based on the SO posts of
pandas library to find the topics, popularity, scopes of pandas library.
In this research paper, total 236711 SO posts where user defined tags are related
to pandas are analyzed to find the topics of pandas library. For topic modeling,
Latent Dirichlet Allocation (LDA) is performed. Finally trend chart is generated
to find the popularity’s of the topics according to the discussions of the software
development forums. In this empirical study, some major findings are shown.
Among of the major findings, firstly we have found the topics and then we have
categories pandas topics which are discussed most in the SO posts. According to
the findings of this paper, there are total twenty six topics and these twenty six
topics can be categorized into six categories. Among of the topics optimization
is the most popular topic though SQL queries and Matplotlib support are the
most difficult topics as SQL queries is having the lowest score and Matplotlib
support is having the lowest accepted answer rate. Secondly, to make a closer
look of the topics and categories, a trend chart is generated from the time slot
July 2011 to February 2022. Some decline and arises are seen in the trend chart
of the topics but the total number of posts are increased gradually as the total
amount of pandas developers increased over time.
The next of the paper is organized in the following way: Section 2 discusses
the background studies of this paper. Methodology is described in Section 3
where data collection, topic modeling and topic naming process are answered.
Section 4 discusses implication of studies where several important expositions
are described. Section 5 describes threats to the validity of our result. Section 6
describes results of our study where section 7 answers the future scopes to work
from the result. Section 8 concludes the paper.
An Empirical Study on How the Developers Discussed about Pandas Topics 3
2 Background Studies
2.1 Stack Overflow
Stack Overflow (SO) is considered as a question and answering sites which has
become popular in recent times for software developers forum. There are some
sites for programmers where programmers can ask questions, can answer other’s
questions and can share ideas. SO is one of them [14].
2.2 Topic Modeling
Topic modeling can be defined as unsupervised classification method which
is similar to clustering data and it is usually used for finding groups of item
[16].Though topic modeling is mostly used for textual data, it can be used for
bioinformatics data, social science data and other source of data [10].
To apply topic modeling in the SO posts related to pandas, in this paper Latent
Dirichlet Allocation (LDA) is used. In LDA, each document is considered as
combination of topics and each topic is considered as combination of words [16].
Topic modeling is used in different papers of various fields where SO posts are
utilized, specially in the aspects of software engineering development [8].
2.3 Pandas
Pandas is an open source data analysis tool which is built on Python program-
ming language and it is considered as fast, flexible and easy tool [16]. Pandas
works in structured dataset and can be leveraged in social science, statistics,
finance and other fields [17]. Since the development of pandas in 2008, the main
aim of pandas library development is to remove the distance between scientific
programming languages and database languages [17].
2.4 Related Works
There are several works related to LDA Topic Modeling [8,8,8,18–23]. Uddin et
al. [8] have done an empirical study based on the IoT discussions of IoT topic on
SO posts. For this purpose, authors have gathered IoT posts from SO and have
leveraged LDA to perform topic modeling. Four research questions are answered
to find the discussion, evolve, question types, popularity and difficlties of IoT
topics.
Stephen W. Thomas [18] has leveraged statistical topic modeling in mining soft-
ware repositories to analyze unstructured and unlabeled data and has found
structure in the textual repositories. Bavota et al. [19] have shown the opportu-
nities of Move Method refactoring and have removed Feature Envy bad smells
from source code and for this purpose Relational Topic Models are utilized.Nabil
et al. [20] have utilized Topic modeling in cloud computing and to discover effi-
cient cloud services, LDA is leveraged. To find the research trends, methodology
and fields of further research in blockchain technology, Shahid et al. [21] have
摘要:

AnEmpiricalStudyonHowtheDevelopersDiscussedaboutPandasTopicsSajibKumarSahaJoy?,FarzadAhmed?,AlHasibMahamud?,andNibirChandraMandalAhsanullahUniversityofScienceandTechnology,Bangladesh{joyjft,farzadahmed6}@gmail.com,{hasib,nibir}.cse@aust.eduAbstract.Pandasisdenedasasoftwarelibrarywhichisusedfordataa...

展开>> 收起<<
An Empirical Study on How the Developers Discussed about Pandas Topics Sajib Kumar Saha Joy Farzad Ahmed Al Hasib Mahamud and Nibir_2.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:2.4MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注