Security and Privacy in Big Data Sharing State-of-the-Art and Research Directions

2025-05-03 0 0 965.82KB 33 页 10玖币
侵权投诉
Security and Privacy in Big Data Sharing: State-of-the-Art
and Research Directions
HOUDA FERRADI, The Hong Kong Polytechnic University, Hong Kong
JIANNONG CAO, The Hong Kong Polytechnic University, Hong Kong
SHAN JIANG, The Hong Kong Polytechnic University, Hong Kong
YINFENG CAO, The Hong Kong Polytechnic University, Hong Kong
DIVYA SAXENA, The Hong Kong Polytechnic University, Hong Kong
Big Data Sharing (BDS) refers to the act of the data owners to share data so that users can nd, access and use
data according to the agreement. In recent years, BDS has been an emerging topic due to its wide applications,
such as big data trading and cross-domain data analytics. However, as the multiple parties are involved in
a BDS platform, the issue of security and privacy violation arises. There have been a number of solutions
for enhancing security and preserving privacy at dierent big data operations (e.g., data operation, data
searching, data sharing and data outsourcing). To the best of our knowledge, there is no existing survey that
has particularly focused on the broad and systematic developments of these security and privacy solutions. In
this study, we conduct a comprehensive survey of the state-of-the-art solutions introduced to tackle security
and privacy issues in BDS. For a better understanding, we rst introduce a general model for BDS and identify
the security and privacy requirements. We discuss and classify the state-of-the-art security and privacy
solutions for BDS according to the identied requirements. Finally, based on the insights gained, we present
and discuss new promising research directions.
ACM Reference Format:
Houda Ferradi, Jiannong Cao, Shan Jiang, Yinfeng Cao, and Divya Saxena. 2022. Security and Privacy in Big
Data Sharing: State-of-the-Art and Research Directions. 1, 1 (October 2022), 33 pages. https://doi.org/10.1145/
nnnnnnn.nnnnnnn
1 INTRODUCTION
The term big data as the name suggest refers to information assets characterized by high volume,
fast access speed, and a large ontological variety. Dealing with big data requires specic technologies
and analytical methods for its transformation into value. The term big data sharing (BDS) refers
to the act of the data sharer to share big data so that the data sharee can nd, access, and use
in the agreed ways. BDS not only improves the speed of getting data insights, but can also help
strengthen cross-domain data analytics and big data trading. Over the last few years, there is a
huge demand for big data sharing in various industries, which has led to an explosive growth of
information. Over 2.5 quintillion bytes of data are created every single day, and the amount of
data is only going to grow from there. By 2020, it is estimated that 1.7MB of data will be created
every second for every person on earth. Due to constraints related to the limitations of data storage
Authors’ addresses: Houda Ferradi, The Hong Kong Polytechnic University, Hung Hom, Hong Kong, Hong Kong; Jiannong
Cao, The Hong Kong Polytechnic University, Hung Hom, Hong Kong, Hong Kong; Shan Jiang, The Hong Kong Polytechnic
University, Hung Hom, Hong Kong, Hong Kong; Yinfeng Cao, The Hong Kong Polytechnic University, Hung Hom, Hong
Kong, Hong Kong; Divya Saxena, The Hong Kong Polytechnic University, Hung Hom, Hong Kong, Hong Kong.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2022 Association for Computing Machinery.
XXXX-XXXX/2022/10-ART $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
, Vol. 1, No. 1, Article . Publication date: October 2022.
arXiv:2210.09230v1 [cs.CR] 17 Oct 2022
2 Houda Ferradi, Jiannong Cao, Shan Jiang, Yinfeng Cao, and Divya Saxena
Table 1. Notations
Notations Descriptions
DW Downloaded data
OP
Data operations which includes: data searching, data out-
sourcing, data computation
ENC
Data is encrypted during the storage and sharing process
RAW Data is raw during the storage and sharing process
INT Data Sharer/Internal Use
EXT Data Sharee/External User
resources, large storage dedicated centralized servers (e.g., cloud) are usually regarded as the best
approach for BDS: on the one hand the centralized server provides a solution which is both scalable
and accommodating for BDS and business analytics, while on the other hand BDS provides data
analytics for actionable insight and making predictions. However, centralized server comes with
a price: it constitutes an added level of security and privacy threats since its essential services
are often outsourced to an untrusted third party, which makes harder to maintain the basic data
security and privacy requirements, such as condentiality, integrity and privacy of the shared data.
Thus, enforcing security and privacy in BDS as a whole is an important concern. Otherwise, data
integrity and condentiality can always be compromised easily.
1.1 General Model of BDS and its Security and Privacy Concerns
In this section we rst discuss the general model of BDS. Then, we describe the general operations
needed based on that general model. According to that, we broadly categorize the security and
privacy notions needed for BDS.
General Model of BDS.
Before discussing security and privacy concerns, it is necessary to
dene a general model of BDS and its input/output. The general model that we consider is based
on the centralized server approach. We propose a general model of BDS system that allows data
sharer/sharee to create, store, access, download, search and manipulate databases, that takes full
account of data access control, user accessibility and the form of shared data.
The remote centralized service provider (e.g., Cloud) which stores and manages the data generated
by data sharer is considered as an untrusted party by the two other parties. The sharing activity
could be either operating on data, e.g., searching or computation or downloading data. The shared
data could be either raw or encrypted. Table 1 shows the notations.
Data Sharer Data Sharee
Store raw data
Store encrypted
data and index
Result
Query: DW & OP
Result
Query: DW & OP
Centralized Server
Fig. 1. General Model for BDS
Our general model for BDS consists of the two following entities (as shown in Fig. 1):
, Vol. 1, No. 1, Article . Publication date: October 2022.
Security and Privacy in Big Data Sharing: State-of-the-Art and Research Directions 3
Data Sharer/Internal User.
A data sharer is the data owner (or internal user) who shares his
own data with a larger server storage. In such system, the data sharer can either use and
operate on his own shared data or gives its access to data sharee.
Data Sharee/External User.
A data sharee (or external user) access/uses other’s stored data.
In such system, the data sharer gives the access to that data to the data sharee, either by
downloading data from the server or by directly operating on data.
After dening what is BDS made of, we explain its general procedures.
BDS operations can be divided into a few distinct groups, which have their own characteristics.
Herein, we dene and discuss the general operations needed for BDS, i.e., data downloading, data
storing, data computation, data searching, and data outsourcing:
Data sharing. Data sharer might query on a BDS platform with some constraints, to learn
hidden patterns, correlations, compute a function, and other insights. Generally speaking, there are
three steps for data querying, i.e., data computation, data downloading, data searching, and data
outsourcing, as follows:
Data downloading. Data downloading is the process through which data sharer or data
sharee retrieve the result from data querying.
Data operation.
(1)
Data computation. Data computation is the process through which data sharer or data
sharee jointly compute a function with BDS over their inputs while keeping those inputs
private.
(2)
Data searching. Data users might query a BDS platform with some constraints, to learn
hidden patterns, correlations and other insights. Data searching needs to be able to search
through unstructured and structured data which requires management of huge amounts of
data as quickly as possible.
(3)
Data outsourcing. Data users might delegate a portion of data to be outsourced to
external providers who oer data management functionalities.
Although there are numerous benets for BDS, it is non-trivial to design a solution because
of the requirements. So, security and privacy are necessary for BDS, otherwise its values will be
disappeared, i.e., if a BDS is not secure or cannot protect the security and privacy, then the users
can hardly trust such a technology and even will not use it. Below, we categorize and explain the
basic concerns, i.e., data security,data privacy, and user privacy as follows:
Security and Privacy Concerns in BDS.
Considering these data sharing operations, we cat-
egorize data security and privacy notions needed in BDS, i.e., Data Security, Data Privacy and
User Privacy. Data security refers to how data is protected from an attacker, namely: Prevent
malicious access, usage, modication or unavailability of the big data from anyone other than the
sharing parties. Data Privacy is about protection of individual’s information from being disclosed
to others as data may contain individual’s sensitive information, such as Personally Identiable
Information (PII), personal healthcare information, and nancial information, which should be
protected whenever the data is collected, stored and shared (e.g., by applying governing regulation
or law like General Data Protection Regulation (GDPR)). There are two parties in BDS, which
are the data sharer and sharee. User privacy is about protecting the identity of data sharer from
exposure by other parties and even each other. It requires that the two parties involved in BDS
focus on the data itself without knowing each other.
1.2 Motivation and Contributions
For the past few years, the topic of big data security and privacy have been explored in many
surveys. Most of these survey papers [
29
,
54
,
111
,
116
,
117
] give a short overview on security and
, Vol. 1, No. 1, Article . Publication date: October 2022.
4 Houda Ferradi, Jiannong Cao, Shan Jiang, Yinfeng Cao, and Divya Saxena
privacy techniques in BDS. This work aims to contribute a comprehensive survey of security and
privacy in BDS in terms of formal denitions, security and privacy requirements, security and
privacy techniques used to fulll requirements, classication of techniques and future challenges.
Compared to other surveys that can be found in the literature, our contributions are as follows:
New taxonomy.
After providing an in-depth understanding and up-to-date discussion
related to the BDS and its operations. We identify security and privacy requirements within
BDS and present a novel taxonomy to structure solutions by fullled requirements.
Comprehensive survey.
In accordance with the taxonomy, we discuss the benets and
limitations of the state of-the-art solutions that fulll the identied security and privacy
requirements.
Future directions.
Finally, based on our survey, we provide the list of lessons learned, open
issues, and directions for future work.
1.3 Security & Privacy Concerns in Big Data Sharing Applications
BDS has many applications elds, such as healthcare [
102
], supply chain management [
105
][
121
],
and open government [
120
]. In this section, we introduce three attracting applications that caught
the attention of both the industry and academia in recent years.
Privacy and Pandemic.
Global leaders are increasingly relying on information about individuals
and communities to control the spread of COVID-19 and respond to its economic, political, social,
and health impacts. Time is of the essence, and leaders must quickly decide essential questions
about what personal information they will collect or disclose, to whom, and under what conditions.
It is important that privacy concerns do not become an obstacle to eective health and safety
measures, but also that we do not open a door to privacy violation or limitless surveillance.
Federated Learning (FL).
FL is a subset within the eld of AI, enables multiple decentralized
edge devices or servers holding local data samples to collaboratively learn a shared prediction model
while keeping all the training data private. In recent years, FL has received extensive attention from
both academia and industry because it can solve privacy problems in machine learning. However,
there are many challenges in FL, and although there are solutions to these challenges, most existing
solutions need a trusted, centralized authority that is dicult to nd.
Medical Research and Healthcare.
In recent years, more and more health data are being
generated. All these big data put together can be used to predict the onset of diseases so that
preventive steps can be taken. However, the health data contains personal health information
(PHI), due to the risk of violating the privacy there will therefore be legal concerns in accessing the
data [
75
]. Health data can be anonymized using masking and de-identication techniques, and be
disclosed to the researchers based on a legal data sharing agreement [50].
1.4 Organisation
This paper is structured as follows: In Section 1, we start by dening BDS, this allows us to discuss
the dierences between the security and privacy notions in BDS. Next, we provide a comprehensive
topical overview of BDS by introducing its general model and general procedures using centralized
architecture. Based on this, we describe the dierent assumptions and scope for that model. In
Section 2, we start by describing the basic security requirements as well as additional ones that
are needed in BDS and then describing their corresponding techniques. In Section 3, we describe
the privacy requirements in terms of data and user privacy and their corresponding techniques.
In Section 4, we review, summarize and compare the security & privacy techniques to fulll the
needed security & privacy requirements. In Section 5, we discuss the challenge issues as well as
new future research directions for BDS. Finally in Section 6, we conclude this article.
, Vol. 1, No. 1, Article . Publication date: October 2022.
Security and Privacy in Big Data Sharing: State-of-the-Art and Research Directions 5
2 SECURITY IN BDS
In this section, we rst start dening the required security properties as well as setting the security
assumptions. Based on this, we overview the existing cryptographic techniques. It allows us to
describe how these techniques can be incorporated in the BDS system. Finally, we provide a
classication that compare the various cryptographic techniques.
2.1 Security Requirements in BDS
In this section, we rst start recalling the four most fundamental security requirements coming
from information system, also known as the CIA triad, which are dened as follows:
Data Confidentiality during Outsourcing.
.Condentiality is the cornerstone of BDS
security which refers to the protection of data during the sharing process against the unauthorized
access. Otherwise, its value could be disappeared.
Data Integrity.
.We distinguish two types of integrity in data sharing context: Usage Integrity
(or Data Integrity) ensuring that any unauthorized modication of sensitive data in the use should
be detectable, otherwise its veracity cannot be consistent. While, Data Source Authenticity means
that the data should be consistent over the whole BDS process. The distinction done between data
integrity and authentication is frequently blurred because integrity can also provide authentication.
In essence, an integrity primitive would take as a parameter a message
𝑚
and prove that the sender
actually mixed his secret with
𝑚
to attest
𝑚
’s origin. An authentication primitive does not involve
any message (no "payload") and is only meant to check that the authenticated party actually knows
a given secret. It follows that to achieve authentication the secret owner can just be challenged to
attest the integrity of random challenge
𝑚
, chosen by the verier. In practice, this is indeed the
way in which numerous commercial products implement authentication using integrity primitives.
Non-repudiation
.While integrity ensures a data has not been tampered with, non-repudiation
provides evidence that an individual or entity from denying having performed a particular action.
In other words, non-repudiation provides proof of the origin of data and the integrity of the data.
Availability
.Data availability ensures that data must be available for use whenever authorized
users want it. However, the introduction of cloud computing has limited issues of data availability
for Big Data due to high has narrowed down issues of cloud. Denial of service (DoS) attack, DDoS
attack, and SYN ood attack are the most common attacks to threat data availability.
Besides the basic security requirements of BDS, we specify the additional security requirements
that we have identied for BDS context, which are dened as follows:
Data Confidentiality during Computation
.Data sharee and data sharer want to jointly
compute a function over their inputs while keeping those inputs private. For example, the data
collected from dierent sensors in the IoT system may be aggregated to generate the targeted
result; the cloud and the clients may cooperate to provide appropriate services. At the same time,
the private information and secret data should be protected. The computation procedures and
results on BDS should only be known by the data sharer and sharee during and after computation.
Unlike traditional cryptographic scenarios, where cryptography ensures security and integrity of
communication or storage and the adversary is supposed an outsider from the system of participants
(an eavesdropper on the sender and receiver), the cryptography techniques in this model should
protect participants’ privacy from each other.
Data Confidentiality during Searching
.Data sharers want to store data in ciphertext
form while keeping the functionality to search keywords in the data, i.e, to protect the privacy of
, Vol. 1, No. 1, Article . Publication date: October 2022.
摘要:

SecurityandPrivacyinBigDataSharing:State-of-the-ArtandResearchDirectionsHOUDAFERRADI,TheHongKongPolytechnicUniversity,HongKongJIANNONGCAO,TheHongKongPolytechnicUniversity,HongKongSHANJIANG,TheHongKongPolytechnicUniversity,HongKongYINFENGCAO,TheHongKongPolytechnicUniversity,HongKongDIVYASAXENA,TheHon...

展开>> 收起<<
Security and Privacy in Big Data Sharing State-of-the-Art and Research Directions.pdf

共33页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:33 页 大小:965.82KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 33
客服
关注