2 Joy et al.
driven software engineering, computer vision, natural language processing etc.
Since 2008, the development of pandas has removed the distance between the
availability of data analysis tools [4]. Pandas is considered for most suitable op-
tion for data analysis tool as it is written in python programming language and
easy to understand for new beginner [5].
In the recent years, the utilization of pandas library is increased rapidly as pan-
das library has reduced the gap between scientific programming languages and
database languages [17]. Pandas library is utilized in most of the sectors like
machine learning, statistics, natural language processing, computer vision and
others. Moreover, pandas library is easy to understand for a beginner and it is
open source tool. For these reasons, most of the developers are now showing
interest to utilize pandas tools in their projects. For the development of pandas
library and it’s utilization, a factor is observed that the discussions regarding
pandas in online developers forums has increased, such as Stack Overflow (SO).
From analysing these post, several findings can be achieved related to pandas li-
brary like it’s popularity, difficulties, future scopes etc. To date, there are around
22.44 million questions are posted in SO [6].Several research works are conducted
based on SO posts in the field of IoT [8], blockchain [9], microservices [10], soft-
ware engineering [8].Some research works are also done based on the function-
ality, popularity, scope to development of pandas library [11–13]. According to
the best of our knowledge, no research work is done based on the SO posts of
pandas library to find the topics, popularity, scopes of pandas library.
In this research paper, total 236711 SO posts where user defined tags are related
to pandas are analyzed to find the topics of pandas library. For topic modeling,
Latent Dirichlet Allocation (LDA) is performed. Finally trend chart is generated
to find the popularity’s of the topics according to the discussions of the software
development forums. In this empirical study, some major findings are shown.
Among of the major findings, firstly we have found the topics and then we have
categories pandas topics which are discussed most in the SO posts. According to
the findings of this paper, there are total twenty six topics and these twenty six
topics can be categorized into six categories. Among of the topics optimization
is the most popular topic though SQL queries and Matplotlib support are the
most difficult topics as SQL queries is having the lowest score and Matplotlib
support is having the lowest accepted answer rate. Secondly, to make a closer
look of the topics and categories, a trend chart is generated from the time slot
July 2011 to February 2022. Some decline and arises are seen in the trend chart
of the topics but the total number of posts are increased gradually as the total
amount of pandas developers increased over time.
The next of the paper is organized in the following way: Section 2 discusses
the background studies of this paper. Methodology is described in Section 3
where data collection, topic modeling and topic naming process are answered.
Section 4 discusses implication of studies where several important expositions
are described. Section 5 describes threats to the validity of our result. Section 6
describes results of our study where section 7 answers the future scopes to work
from the result. Section 8 concludes the paper.