Using Full-Text Content to Characterize and Identify Best Seller Books Giovana D. da Silva1 Filipi N. Silva2 Henrique F. de Arruda3

2025-05-06 0 0 584.39KB 27 页 10玖币

侵权投诉

Using Full-Text Content to Characterize and Identify Best Seller

Books

Giovana D. da Silva1, Filipi N. Silva2, Henrique F. de Arruda3,

B´arbara C. e Souza1, Luciano da F. Costa4and Diego R. Amancio1

1Institute of Mathematics and Computer Science – USP,

Avenida Trabalhador S˜ao-carlense,

no400, CEP 13566-590,

S˜ao Carlos, SP, Brazil.

2Indiana University Network Science Institute,

Bloomington, Indiana, 47408, USA

3CENTAI, Corso Inghilterra 3,

10138, Turin, Italy

4S˜ao Carlos Institute of Physics – USP,

Avenida Trabalhador S˜ao-carlense,

no400, CEP 13566-590,

S˜ao Carlos, SP, Brazil.

(Dated: May 12, 2023)

arXiv:2210.02334v2 [cs.CL] 11 May 2023

Abstract

Artistic pieces can be studied from several perspectives, one example being their reception among

readers over time. In the present work, we approach this interesting topic from the standpoint of

literary works, particularly assessing the task of predicting whether a book will become a best seller.

Dissimilarly from previous approaches, we focused on the full content of books and considered

visualization and classiﬁcation tasks. We employed visualization for the preliminary exploration of

the data structure and properties, involving SemAxis and linear discriminant analyses. Then, to

obtain quantitative and more objective results, we employed various classiﬁers. Such approaches

were used along with a dataset containing (i) books published from 1895 to 1924 and consecrated

as best sellers by the Publishers Weekly Bestseller Lists and (ii) literary works published in the

same period but not being mentioned in that list. Our comparison of methods revealed that the

best-achieved result — combining a bag-of-words representation with a logistic regression classiﬁer

— led to an average accuracy of 0.75 both for the leave-one-out and 10-fold cross-validations. Such

an outcome suggests that it is unfeasible to predict the success of books with high accuracy using

only the full content of the texts. Nevertheless, our ﬁndings provide insights into the factors leading

to the relative success of a literary work.

I. INTRODUCTION

Understanding the factors and reasons determining the eﬀectiveness and acceptance of

given pieces of artistic or scientiﬁc work represents a continuing challenge in artiﬁcial intel-

ligence (e.g., [4, 7, 16, 28, 30]). As is often the case with complex systems, not only a large

number of possible factors is potentially involved, but their individual and combined eﬀects

also tend to be highly non-linear. In this manner, small eﬀects can lead to considerable

impacts, being also likely to vary along time and space in modes that are hard to predict.

Among the several aspects that are more likely to inﬂuence the visibility and accomplish-

ment of an artistic piece, we have its intrinsic quality,innovation, and aﬃnity with the main

trends, interests, and expectations predominating in a given period and place. All these

three main aspects are not only challenging to deﬁne, but even more so to predict, which

has motivated growing interest from the scientiﬁc community (e.g., [32]).

A better understanding of the motivations why an artistic piece becomes successful con-

stitutes a particularly interesting objective for a handful of reasons: (i) this type of study

can motivate the development of new concepts and methods capable of quantifying the three

main aspects identiﬁed above, namely quality, innovation, and aﬃnity of an artistic piece;

(ii) that kind of research has great potential for revealing important aspects of the mecha-

nisms underlying human preferences for speciﬁc subjects and styles along time and space;

(iii) such developments can lead to strategies for predicting the acceptance of certain types

of works, which may provide subsidies and motivation for developing new and more eﬀective

artistic pieces.

The present work aims at studying whether it is feasible to characterize and identify

stories and narratives listed as best sellers by combining full-text content information and

machine learning models. In this regard, the textual content of a set of books was modeled,

and a series of experiments assessed the possibility of automatically diﬀerentiating a best

seller from an ordinary book. In particular, we employed a dataset encompassing the full-text

content of literary works collected from the Project Gutenberg platform. The dataset was

split into two categories: success (books that appear at least once in the Publishers Weekly

Bestseller Lists) and others. After applying a preprocessing step (removal of stopwords,

lemmatization, and tokenization), the content of each book was embodied in terms of a

word embedding representation by using the bag-of-words [17] and doc2vec [15] approaches.

Finally, we employed diﬀerent strategies to assess the prediction of the success of books in

terms of their embedding representations, including: (i) visualization approaches, namely

the linear discriminant analysis (LDA) [12] and SemAxis [2] techniques; (ii) classiﬁcation

approaches, encompassing diﬀerent models and cross-validation strategies.

In contrast to previous studies, here we rely on one of the prime published sources of

best sellers book lists, namely the Publishers Weekly Bestsellers Lists, which comprises the

best-selling books every year since 1885. Although its criteria to deﬁne a book as an absolute

success is not entirely speciﬁed, it is established that every considered paperbound book sold

at least 2,000,000 copies, and every selected hardbound book sold 750,000 copies or more.

It is also settled that Publishers Weekly only regards books distributed through the trade

– that is, bookstores and libraries –, not including those sold by mail or book clubs [14].

Besides that, our work compounds the list of few studies which analyzed the success factor by

analyzing the full-text content of the texts, posthumously modeling it through embeddings,

and analyzing it both qualitatively (applying visualization and seeking for words that lead

to discrimination) and quantitatively (involving supervised classiﬁers).

The obtained results suggest that it is infeasible to predict the success of a literary work

with high accuracy by using only its full-text content. The best classiﬁcation accuracy

acquired throughput the value of 0.75, combining a bag-of-words representation with a

logistic regression model, which is a fair-to-middling outcome. Nonetheless, our experiments

evince that the subject of the books does not seem to be a core factor for a title becoming

a best seller and that there are words more typically found in this category of books.

This work is organized as follows. Section II presents and discusses the related works.

In Section III, we present the research questions. Section IV describes the used datasets.

Section V describes the methodology adopted to analyze the books, including text pre-

processing, representation, visualization and classiﬁcation. The results and discussions are

reported in Section VI. Finally, in Section VII, we present the conclusions and future works.

II. RELATED WORKS

The study conducted in [34] analyzed the success of books using as reference the The New

York Times Best Sellers, which includes a list of best-selling books in the United States. The

authors considered the books appearing on the list between August 2008 and March 2016.

As additional information, the sales patterns of books were also considered by using data

from NPD BookScan [34]. Several interesting results were reported. Fiction books were

found to be more likely to become best sellers, while nonﬁction books tended to be sold

with lower intensity. The authors also proposed a model that can accurately measure long-

term impact since it can predict the number of copies sold by best sellers short after their

release. The proposed description was found to be consistent with a previous model devised

to describe the attention received by scientiﬁc papers [30]. The authors argue, therefore,

that the underlying processes of attention are similar – despite the diﬀerences in time scale.

A model to predict book sales was proposed in [32]. The authors used as a dataset the

NPD Bookscan, focusing on a list of the 10 thousand top-selling books in a given period. A

machine learning approach was proposed using diﬀerent book features. Authors’ visibility

was taken into account by measuring the public interest in authors via Wikipedia page

views. Previous sales were also considered as a feature to measure the previous success of

authors. Book features included genre (e.g., horror and science ﬁction) and topic information

(as provided by readers). In addition, publishers’ information was used. All features were

combined in the so-called Learning to Place (L2P) machine learning algorithm [31], which

aims at classifying a new instance (i.e., predicting book sales) within a sequence of previously

published books. This study found that in ﬁction and nonﬁction books, the publisher quality

tends to play an important role in the prediction. The visibility of authors was also found

to be an important feature, as more visible authors potentially are more likely to sell more

copies. Finally, the other factors related to the text content itself (e.g., genre and topic

information) were found to play relatively a minor role in the prediction model.

Diﬀerently from previous works that did not take into account the textual content [32, 34],

the relevance of writing style was analyzed in [3]. The authors analyzed full books from

diﬀerent genres (e.g. adventure, mystery, ﬁction). The dataset was collected from the Project

Gutenberg repository. Several linguist marks of writing style were used to characterize

the texts. Examples include lexical features, distribution of grammar rules, and sentiment

analysis. The authors used SVM as classiﬁer [9], and download counts were used as a

surrogate for the visibility of books. Additional information such as award recipients and

the number of copies sold was also used to quantify success. The authors concluded that

the used stylistic metrics are eﬀective to quantify the success of novels.

Because only a few works have analyzed the content of books to predict if they will

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UsingFull-TextContenttoCharacterizeandIdentifyBestSellerBooksGiovanaD.daSilva1,FilipiN.Silva2,HenriqueF.deArruda3,BarbaraC.eSouza1,LucianodaF.Costa4andDiegoR.Amancio11InstituteofMathematicsandComputerScience{USP,AvenidaTrabalhadorS~ao-carlense,no400,CEP13566-590,S~aoCarlos,SP,Brazil.2IndianaUnivers...

展开>> 收起<<

Using Full-Text Content to Characterize and Identify Best Seller Books Giovana D. da Silva1 Filipi N. Silva2 Henrique F. de Arruda3.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Using Full-Text Content to Characterize and Identify Best Seller Books Giovana D. da Silva1 Filipi N. Silva2 Henrique F. de Arruda3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: