Doing data science with platforms crumbs an investigation into fakes views on YouTube Maria Castaldo1 Paolo Frasca1 Tommaso Venturini24 and

2025-04-27 0 0 1.1MB 20 页 10玖币
侵权投诉
Doing data science with platforms crumbs: an
investigation into fakes views on YouTube
Maria Castaldo1,*, Paolo Frasca1, Tommaso Venturini2,4, and
Floriana Gargiulo3
1Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, GIPSA-lab, F-38000 Grenoble,
France
2CIS, CNRS, 59 Rue Pouchet, 75017 Paris, France
3Gemass, CNRS, 59 Rue Pouchet, 75017 Paris, France
4University of Geneva, Switzerland
*Corrisponding Author: maria.castaldo@grenoble-inp.fr
Abstract
This paper contributes to the ongoing discussions on the scholarly access
to social media data, discussing a case where this access is barred despite its
value for understanding and countering online disinformation and despite
the absence of privacy or copyright issues. Our study concerns YouTube’s
engagement metrics and, more specifically, the way in which the platform
removes "fake views" (i.e., views considered as artificial or illegitimate by
the platform). Working with one and a half year of data extracted from a
thousand French YouTube channels, we show the massive extent of this
phenomenon, which concerns the large majority of the channels and more
than half the videos in our corpus. Our analysis indicates that most fakes
news are corrected relatively late in the life of the videos and that the final
view counts of the videos are not independent from the fake views they
received. We discuss the potential harm that delays in corrections could
produce in content diffusion: by inflating views counts, illegitimate views
could make a video appear more popular than it is and unwarrantedly
encourage its human and algorithmic recommendation. Unfortunately, we
cannot offer a definitive assessment of this phenomenon, because YouTube
provides no information on fake views in its API or interface. This paper
is, therefore, also a call for greater transparency by YouTube and other
online platforms about information that can have crucial implications for
the quality of online public debate.
1 Introduction
Fake views, real trends
"We want to make sure that videos are viewed by actual humans and not
computer programs"[
25
]. As stated in its official web pages, YouTube has a
1
arXiv:2210.01096v1 [cs.SI] 28 Sep 2022
strong view count policy and has not been afraid of enforcing it. In December
2012, the platform deleted 2 billion views from the channels of record companies
such as Universal and Sony [
22
] [
26
] [
2
] [
19
]. Over the years, countless youtubers
have suffered sudden and drastic cuts to their views (and many have complained
about it, often through YouTube videos). According to YouTube’s policies [24]
[
35
] [
25
], these interventions aim at preserving a “meaningful human interaction
on the platform” and to oppose “anything that artificially increases the number
of views, likes, comments or other metric either through the use of automatic
systems or by serving up videos to unsuspecting viewers” [24].
Despite the media interest in the phenomenon [
27
] [
38
], not much research
has been carried out on the implementation of this policy. The general lack of
studies on the subject is partly motivated by the fact that since 2016 YouTube
has largely restricted access to its data through the API, making the researchers’
task more difficult. Up to our knowledge, the only previous work concerning
views count corrections is that of Marciel et al.[
32
] in 2016. This paper studies
the phenomenon of views corrections in relation to video monetization, to identify
possible frauds, drawing on research carried out on ads frauds in other social
media [
14
] [
34
]. In their work, Marciel et al. created some sample YouTube
channels and inflated their views though bots. Strikingly, they found that
“YouTube monetizes (almost) all the fake views” generated by the authors, while
it “detects them more accurately when videos are not monetized”.
Although we consider this investigation into the correlation between moneti-
zation and views correction a first useful step toward understanding YouTube’s
policy, we believe that some other pressing questions should be addressed by
the scientific community. For instance, can fake views have an impact on the
success of a video and be used to manipulate YouTube’s attention cycle? It is
well known that, on social media, future visibility is highly dependent on past
popularity, as trending contents tend to be favored by human influencers [
42
] and
recommendation algorithms [
23
], both of which are highly sensitive to trendiness
metrics [
49
]. In YouTube in particular, the recommendation engine represents
the most important source of views [
56
] and, as admitted by its developers,
“in addition to the first-order effect of simply recommending new videos that
users want to watch, [has] a critical secondary phenomenon of bootstrapping
and propagating viral content” [
18
]. Quite deliberately, YouTube’s algorithm
creates a positive feedback that skews visibility according to a rich-get-richer
dynamic [
5
] [
36
] [
46
]. As acknowledged by YouTube engineers: “models trained
using data generated from the current system will be biased, causing a feedback
loop effect. How to effectively and efficiently learn to reduce such biases is an
open question” [55].
This is where fake views come into play. Indeed, if the correction of ille-
gitimate views happens too late, these views have the potential to weight in
the cycle of trendiness [
12
] and unfairly propel their targets. If YouTube fake
views correction is significantly slower than its recommendation dynamics, then
artificially promoted videos risk to be favored by human and algorithmic rec-
ommendations, and thus reach larger audiences and collect extra real views. If,
before being deleted, fake views are able to trigger a cascade effect that increases
2
the visibility of some content, then they may be used to manipulate online debate.
Not unlike social bots [
21
] and paid commentators [
28
], fake views could give
the false impression that some content is highly popular and endorsed by many,
thus distorting public debate and ultimately endangering democratic processes
[54] [39] [33] [4] [31] [3] [41] [30].
Approach and tentative results
While much research has been dedicated to identifying social bots [
40
] [
7
] [
44
] [
48
]
and analyzing artificially produced contents [
50
] [
20
] [
52
] [
8
] [
51
] [
16
] [
17
] [
45
] [
29
],
no attention has been devoted to the role played by fake views in content diffusion.
This work addresses this open and urgent question by analysing YouTube data
at an unprecedented temporal granularity. As detailed below, we collected two
datasets from more than a thousand French ‘politics and news’ channels. The
first dataset is a 17 month collection of the hourly views counts of over 270.000
videos. The second dataset collects the views evolution of a thousand videos, but
does so every five minutes. These datasets and their combined analysis allow us
to examine, for the first time, the timing of YouTube fake views corrections and
to rise concern about their consequences.
In summary, our analysis has lead to three main findings. Our first finding
concerns the remarkable size of the phenomenon, since we identify fake views
corrections in almost all the channels of our larger corpus and more than a half of
the videos. Our second finding is that the rhythms of fake views corrections are
inconsistent with those of view production, thereby suggesting that significant
delays could occur between the generation of fake views and their correction.
Our third finding is the existence of a correlation between fake and real views:
videos with more fake views also collect a larger number of legitimate views.
1.1 Disclaimer
Despite our best efforts, our analysis cannot provide a definitive proof of the
influence of fake views, because YouTube offers very little temporal information
about fake and real views. At any given moment, we can retrieve the total
number of views collected until then by a video, but not the history of their
evolution – forcing us to collect the views count at regular intervals. Even more
frustrating, through its interface or API, YouTube offer no information about
fake views removal. This information, it is worth pointing out, does not raise
any particular security or privacy issue – if the real views count is revealed,
why not the fake views one? The fake views count would provide a crucial cue
for fact-checkers, journalists and scholars to identify possible shady operations
around (although not necessarily by) a video or a channel, and yet YouTube
decides to hide this information, which it could easily make available. As many
of its likes, the platform share its data only to the extent that they help its
business model. Like it or not, however, online platforms have become more than
commercial players and the influence they have acquired on public debate should
bind them to greater transparency. We wish we could relegate our methodological
3
troubles to a footnote and focus on the results. Unfortunately, YouTube does
not make this possible. This paper thus is also a way to share with the research
community a few of workarounds we developed to study attention cycles despite
YouTube’s opacity and in particular a method to reconstruct fake views count
through frequent monitoring and machine learning techniques.
2 Results
Our results are mainly based on the analysis of a data set which records hour by
hour and for 17 months the number of the views collected by 270.133 videos pub-
lished by 1.064 French channels dealing with news and politics. The limitations
imposed by YouTube API made it impossible to collect data more frequently
(or to monitor videos more than a week after their publication). The hourly
frequency, of course, leads to a significant loss of information. As no direct
information is provided about views corrections, we can only estimate them by
observing the hours with a negative delta in the views count, i.e. the hours in
which a video loses more views than it gains. In the hours in which the number
of new views exceeds the number of removed views, the platform’s interventions
become invisible. For this reason, we did not draw our conclusions from the
direct analysis of this dataset: instead, we have developed a machine learning
method to discover these hidden correction. This method is described in details
in the Data and Methods section.
In the Data and Methods section, we first give a quantitative estimation
of the amount of information lost due to the hourly frequency of collection,
in comparison with a 5-minute frequency. Then, we propose two methods to
reconstruct the number of YouTube corrections (a Benchmark Method, based
on an heuristic procedure, and the Reconstruction model, a machine learning
algorithm) and we compare their performance. The Reconstruction model,
allowing to better reduce the information loss, is finally applied on all the
original time series to reconstruct the number of views removed by YouTube but
not directly visible in the raw data set. In the rest of this section, we present
the results obtained from the analysis of the data reconstructed in this way.
2.1 Scale of the phenomenon
The removal of false views is evident when the series of hourly views has negative
entries: some examples are shown in Figure 1A. In fact, we have found out that
fake views corrections are extremely common: we detected corrections for almost
all monitored channels (90% of them) and for 61% of the videos in our corpus.
Such a large scope underscores the importance of better understanding how these
corrections are made. In fact, corrections in our corpus amount to about 22.5
millions. Although they represent, on average, a seemingly modest 0.5 percent
of the total views, their number remains impressive and, more importantly, their
distribution is very uneven. If we look at the Lorenz curve (Figure 1D) of the
distribution of corrections among different videos, we can see that most of the
4
摘要:

Doingdatasciencewithplatformscrumbs:aninvestigationintofakesviewsonYouTubeMariaCastaldo1,*,PaoloFrasca1,TommasoVenturini2,4,andFlorianaGargiulo31Univ.GrenobleAlpes,CNRS,Inria,GrenobleINP,GIPSA-lab,F-38000Grenoble,France2CIS,CNRS,59RuePouchet,75017Paris,France3Gemass,CNRS,59RuePouchet,75017Paris,Fran...

展开>> 收起<<
Doing data science with platforms crumbs an investigation into fakes views on YouTube Maria Castaldo1 Paolo Frasca1 Tommaso Venturini24 and.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.1MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注