Using Full-Text Content to Characterize and Identify Best Seller Books Giovana D. da Silva1 Filipi N. Silva2 Henrique F. de Arruda3

2025-05-06 0 0 584.39KB 27 页 10玖币
侵权投诉
Using Full-Text Content to Characterize and Identify Best Seller
Books
Giovana D. da Silva1, Filipi N. Silva2, Henrique F. de Arruda3,
arbara C. e Souza1, Luciano da F. Costa4and Diego R. Amancio1
1Institute of Mathematics and Computer Science – USP,
Avenida Trabalhador S˜ao-carlense,
no400, CEP 13566-590,
ao Carlos, SP, Brazil.
2Indiana University Network Science Institute,
Bloomington, Indiana, 47408, USA
3CENTAI, Corso Inghilterra 3,
10138, Turin, Italy
4ao Carlos Institute of Physics – USP,
Avenida Trabalhador S˜ao-carlense,
no400, CEP 13566-590,
ao Carlos, SP, Brazil.
(Dated: May 12, 2023)
1
arXiv:2210.02334v2 [cs.CL] 11 May 2023
Abstract
Artistic pieces can be studied from several perspectives, one example being their reception among
readers over time. In the present work, we approach this interesting topic from the standpoint of
literary works, particularly assessing the task of predicting whether a book will become a best seller.
Dissimilarly from previous approaches, we focused on the full content of books and considered
visualization and classification tasks. We employed visualization for the preliminary exploration of
the data structure and properties, involving SemAxis and linear discriminant analyses. Then, to
obtain quantitative and more objective results, we employed various classifiers. Such approaches
were used along with a dataset containing (i) books published from 1895 to 1924 and consecrated
as best sellers by the Publishers Weekly Bestseller Lists and (ii) literary works published in the
same period but not being mentioned in that list. Our comparison of methods revealed that the
best-achieved result — combining a bag-of-words representation with a logistic regression classifier
— led to an average accuracy of 0.75 both for the leave-one-out and 10-fold cross-validations. Such
an outcome suggests that it is unfeasible to predict the success of books with high accuracy using
only the full content of the texts. Nevertheless, our findings provide insights into the factors leading
to the relative success of a literary work.
2
I. INTRODUCTION
Understanding the factors and reasons determining the effectiveness and acceptance of
given pieces of artistic or scientific work represents a continuing challenge in artificial intel-
ligence (e.g., [4, 7, 16, 28, 30]). As is often the case with complex systems, not only a large
number of possible factors is potentially involved, but their individual and combined effects
also tend to be highly non-linear. In this manner, small effects can lead to considerable
impacts, being also likely to vary along time and space in modes that are hard to predict.
Among the several aspects that are more likely to influence the visibility and accomplish-
ment of an artistic piece, we have its intrinsic quality,innovation, and affinity with the main
trends, interests, and expectations predominating in a given period and place. All these
three main aspects are not only challenging to define, but even more so to predict, which
has motivated growing interest from the scientific community (e.g., [32]).
A better understanding of the motivations why an artistic piece becomes successful con-
stitutes a particularly interesting objective for a handful of reasons: (i) this type of study
can motivate the development of new concepts and methods capable of quantifying the three
main aspects identified above, namely quality, innovation, and affinity of an artistic piece;
(ii) that kind of research has great potential for revealing important aspects of the mecha-
nisms underlying human preferences for specific subjects and styles along time and space;
(iii) such developments can lead to strategies for predicting the acceptance of certain types
of works, which may provide subsidies and motivation for developing new and more effective
artistic pieces.
The present work aims at studying whether it is feasible to characterize and identify
stories and narratives listed as best sellers by combining full-text content information and
machine learning models. In this regard, the textual content of a set of books was modeled,
and a series of experiments assessed the possibility of automatically differentiating a best
seller from an ordinary book. In particular, we employed a dataset encompassing the full-text
content of literary works collected from the Project Gutenberg platform. The dataset was
split into two categories: success (books that appear at least once in the Publishers Weekly
Bestseller Lists) and others. After applying a preprocessing step (removal of stopwords,
lemmatization, and tokenization), the content of each book was embodied in terms of a
word embedding representation by using the bag-of-words [17] and doc2vec [15] approaches.
3
Finally, we employed different strategies to assess the prediction of the success of books in
terms of their embedding representations, including: (i) visualization approaches, namely
the linear discriminant analysis (LDA) [12] and SemAxis [2] techniques; (ii) classification
approaches, encompassing different models and cross-validation strategies.
In contrast to previous studies, here we rely on one of the prime published sources of
best sellers book lists, namely the Publishers Weekly Bestsellers Lists, which comprises the
best-selling books every year since 1885. Although its criteria to define a book as an absolute
success is not entirely specified, it is established that every considered paperbound book sold
at least 2,000,000 copies, and every selected hardbound book sold 750,000 copies or more.
It is also settled that Publishers Weekly only regards books distributed through the trade
– that is, bookstores and libraries –, not including those sold by mail or book clubs [14].
Besides that, our work compounds the list of few studies which analyzed the success factor by
analyzing the full-text content of the texts, posthumously modeling it through embeddings,
and analyzing it both qualitatively (applying visualization and seeking for words that lead
to discrimination) and quantitatively (involving supervised classifiers).
The obtained results suggest that it is infeasible to predict the success of a literary work
with high accuracy by using only its full-text content. The best classification accuracy
acquired throughput the value of 0.75, combining a bag-of-words representation with a
logistic regression model, which is a fair-to-middling outcome. Nonetheless, our experiments
evince that the subject of the books does not seem to be a core factor for a title becoming
a best seller and that there are words more typically found in this category of books.
This work is organized as follows. Section II presents and discusses the related works.
In Section III, we present the research questions. Section IV describes the used datasets.
Section V describes the methodology adopted to analyze the books, including text pre-
processing, representation, visualization and classification. The results and discussions are
reported in Section VI. Finally, in Section VII, we present the conclusions and future works.
II. RELATED WORKS
The study conducted in [34] analyzed the success of books using as reference the The New
York Times Best Sellers, which includes a list of best-selling books in the United States. The
authors considered the books appearing on the list between August 2008 and March 2016.
4
As additional information, the sales patterns of books were also considered by using data
from NPD BookScan [34]. Several interesting results were reported. Fiction books were
found to be more likely to become best sellers, while nonfiction books tended to be sold
with lower intensity. The authors also proposed a model that can accurately measure long-
term impact since it can predict the number of copies sold by best sellers short after their
release. The proposed description was found to be consistent with a previous model devised
to describe the attention received by scientific papers [30]. The authors argue, therefore,
that the underlying processes of attention are similar – despite the differences in time scale.
A model to predict book sales was proposed in [32]. The authors used as a dataset the
NPD Bookscan, focusing on a list of the 10 thousand top-selling books in a given period. A
machine learning approach was proposed using different book features. Authors’ visibility
was taken into account by measuring the public interest in authors via Wikipedia page
views. Previous sales were also considered as a feature to measure the previous success of
authors. Book features included genre (e.g., horror and science fiction) and topic information
(as provided by readers). In addition, publishers’ information was used. All features were
combined in the so-called Learning to Place (L2P) machine learning algorithm [31], which
aims at classifying a new instance (i.e., predicting book sales) within a sequence of previously
published books. This study found that in fiction and nonfiction books, the publisher quality
tends to play an important role in the prediction. The visibility of authors was also found
to be an important feature, as more visible authors potentially are more likely to sell more
copies. Finally, the other factors related to the text content itself (e.g., genre and topic
information) were found to play relatively a minor role in the prediction model.
Differently from previous works that did not take into account the textual content [32, 34],
the relevance of writing style was analyzed in [3]. The authors analyzed full books from
different genres (e.g. adventure, mystery, fiction). The dataset was collected from the Project
Gutenberg repository. Several linguist marks of writing style were used to characterize
the texts. Examples include lexical features, distribution of grammar rules, and sentiment
analysis. The authors used SVM as classifier [9], and download counts were used as a
surrogate for the visibility of books. Additional information such as award recipients and
the number of copies sold was also used to quantify success. The authors concluded that
the used stylistic metrics are effective to quantify the success of novels.
Because only a few works have analyzed the content of books to predict if they will
5
摘要:

UsingFull-TextContenttoCharacterizeandIdentifyBestSellerBooksGiovanaD.daSilva1,FilipiN.Silva2,HenriqueF.deArruda3,BarbaraC.eSouza1,LucianodaF.Costa4andDiegoR.Amancio11InstituteofMathematicsandComputerScience{USP,AvenidaTrabalhadorS~ao-carlense,no400,CEP13566-590,S~aoCarlos,SP,Brazil.2IndianaUnivers...

展开>> 收起<<
Using Full-Text Content to Characterize and Identify Best Seller Books Giovana D. da Silva1 Filipi N. Silva2 Henrique F. de Arruda3.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:584.39KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注