Methods To Ensure Privacy Regarding Medical Data Including an examination of the differential privacy algorithm RAPPOR and its implementation in CrypTool 2

2025-05-02 0 0 514.75KB 10 页 10玖币

侵权投诉

Methods To Ensure Privacy Regarding Medical Data

Including an examination of the differential privacy algorithm

RAPPOR and its implementation in “CrypTool 2”

Christina W¨

olk

University of Siegen

Student (Bachelor’s Degree in Computer Science)

christina.woelk@student.uni-siegen.de

Abstract—This document examines several applicable methods

to ensure privacy of data gathered in the health care sector. To

ensure a common understanding of the topic, the introduction

explains the need for anonymization methods based on an

example. Next, reasons for data collection are introduced in

connection to the purpose to protect mentioned data, as well

as currently applicable privacy laws to enforce this privacy.

The question What kind of privacy we are talking about and

what conditions have to be fulﬁlled is dealt within the subsequent

chapter “Differential Privacy”. Thus being established, common

anonymization methods are explained and reviewed for their use

in the healthcare sector.

The RAPPOR algorithm and its differential privacy is dealt

with in more detail before coming to a conclusion.

I. INTRODUCTION

Privacy is valued by the majority of German citizens. With

the increasing amount of possibilities to use technology and

collect data, safety measures have to be increased as well.

So-called ”anonymity” (in this case meaning censoring one’s

name in a dataset) alone is not sufﬁcient anymore to ensure

one’s privacy. We often can be identiﬁed by a very limited

amount of data (e.g., date of birth, hometown and gender).

Since personal data is often collected in several matters and

from several sources, records (e.g., medical records) can be

allocated to a speciﬁc person and exposed. An example from

the 1990s:

The “Massachusetts Group Insurance Commission”

released anonymized data on state employees revealing

hospital visits to establish a bigger database for researchers.

During the anonymization, name, address and social security

number were removed. Latanya Sweeney (at that time a

graduate student in computer science) was able to identify

the then Governor of Massachusetts just by buying the voter

rolls (for approximately 20 US-Dollars[2]) from the city of

Cambridge (where she knew the governor resided) which

included name, address, ZIP code, birth date and sex of every

voter. Now knowing his ZIP code, she was able to identify

his medical records; and therefore, knowing every diagnosis

and every prescription. [2]

This example depicts the limitations of anonymization. In

this case, the student left it at sending the medical records to

the governor’s ofﬁce [2]. However, so-called ”linkage attacks”

(where sensitive information can be allocated to the person

it belongs to) still are a risk that should not be ignored.

Therefore, this paper examines methods to ensure privacy

regarding medical data, especially focusing on the RAPPOR

algorithm provided by Google, as described in the abstract.

II. DATA PROTECTION IN THE HEALTHCARE SECTOR

Information collected for research purposes is not new.

However, when dealing with medical researches, certain prob-

lems occur to a greater length then elsewhere. The most

frequent one is the problem of the size of the study. When

researching how to improve health, the need for probands

suffering from certain medical conditions is almost inevitable.

Even something small as a new medicine that alleviates

headaches can only be tested on persons who have a headache

for obvious reasons. For rare diseases, this can become quite

the big problem since the results would not be signiﬁcant.

So the group of potential probands is often very limited and

can not be enlarged (also for obvious, ethical reasons). This

complicates the anonymization because the larger the group,

the harder it is to identify the individual.

A. What kind of data is gathered? (And for what reasons?)

Data is not always the same. Especially in the health care

sector, the patients data is often very sensitive. Let us review

the data types we often deal with in the following sections.

1) Data concerning age and sex: Effects and side effects

of medical treatments can differ depending on the sex of the

patient. This correlates with height and weight inﬂuencing the

appropriate dose rate, as well as the difference of hormones

of the patient interacting with speciﬁc medications. [19]

Therefore this information is needed to ensure a proper

medication. But not only is this relevant for treatment, it

is also essential for the prevention of statistical probable

illnesses or diseases. A good example is the invitation of the

arXiv:2210.09963v1 [cs.CR] 18 Oct 2022

general practitioner to get a checkup for breast cancer when

a female reaches the age of 30 or the checkup for prostate

cancer when a male reaches the age of 45 [20]. Even if the

causes for the increased occurrence happen to be unknown,

taking precautions is recommended because pure statistics

do, in fact, save lives.

All data regarding the treatment, as well as the diagnosis,

are protected by medical conﬁdentiality and data protection

ordinance which the patient has to agree upon.

2) Socioeconomic factors: Socioeconomic factors such

as income, spoken languages, place of residence, job and

family status are rarely relevant for ordinary treatment.

Nevertheless they can be of utmost importance in terms of

research, especially long term studies. The development and

emergence of many diseases aren’t dependent on one, but a

large amount of factors. For instance, it is known that there

are many factors that increase the risk of getting cancer at a

younger age (although, there still is a lot of research to be

done). Therefore, this kind of data mining is necessary to get

closer to gain insight on these matters.

3) Data regarding lifestyle: Balanced diet, doing sport on

a regular basis, etc., inﬂuences health in a positive way; to

the contrary, a lack of exercise and eating a lot of unhealthy

food increases the risk of diseases like type 2 diabetes [21].

This data is also important for long term studies, as well as

almost everything else, as it affects our everyday life. It is

hard to measure and is almost solely observed by the patient

rather than the doctor. This leads to inaccuracy which is why

a lot of data has to be gathered before it can be deemed useful.

B. What are the consequences of insufﬁcient protection of

medical data?

1) Patient’s point of view: As can be seen in the paragraphs

above, medical data consists of a variety of sensitive data

concerning our private life. Many of these factors (e.g.,

income) indicate or inﬂuence a certain social status.

Revealing sensitive data can lead to social stigmatization, and

therefore mental stress. A doctor or researcher will not judge

you for having mental illnesses or consuming drugs, but your

social environment or your employer might do so. It does not

have to be something with a lot of prejudices attached to it.

The concern of you, as an employee, who is on sick leave

more often due to being a migraine patient might be enough

for you to not get the job.

Other mentionable parties which are potentially interested

in your medical data are insurance companies. They want to

know the risk of having to pay for you as their customer or

having a reason to make the insurance more expensive for

you. Additionally, it has to be considered that information

regarding one’s medical data can also affect other people, as

some diseases are genetic.

2) Company’s point of view - The General Data Protection

Regulation: Since 2018 the General Data Protection

Regulation based on a decree of the European Union is in

force in Germany [4]. Its main goal is to ensure informational

self-determination as well as other fundamental freedoms. In

Germany, the person to whom the data is attributive is the

owner of the data. Therefore, it is prohibited to assimilate

personal data unless stated otherwise. Data protection aims for

data integrity, data conﬁdentiality and (for this topic especially

important) data resilience which means the resilience towards,

e.g., hackers. [4]

Violations of the General Data Protection Regulation can

lead to ﬁnes up to 20 million Euro or 4% of the worldwide

sales of the company (depending on which amount is higher)

[4].

III. DIFFERENTIAL PRIVACY

Differential privacy pursues the goal to obtain as accurate

responses as possible (e.g., from surveys or user behaviour)

while making it as difﬁcult as possible to identify a person

by his or her given answers. The parameter is used to

”measure” the extent of the given privacy. A small represents

a high privacy guarantee, as a consequence of the deﬁnition

of differential privacy:

”A randomized function κgives -differential privacy if for

all data sets D1 and D2 differing on at most one element, and

all S⊆Range(κ):

Pr[κ(D1)∈S]≤eε×Pr[κ(D2)∈S].(1)

”[5]. Range(κ) is the set of every possible outcome of function

κwhich could, for example, be the set of all whole numbers.

The left side of the inequation describes the probability (Pr)

that the full database (D1), randomized by the function κ,

is included in the subset (S). The right side does the same

except one entry has been removed from the database and the

term is multiplied (×) with e. Note that for = 0, the term

is multiplied with one, giving the highest possible privacy.

This means that the privacy for each user is about the same

and it does not matter whether a person is included in the

database or not. When using differential privacy methods, the

real responses aren’t necessarily sent to the server. Instead,

with a certain probability, the given answer will be a random

one. This protects the users data, even if the users response

is intercepted multiple times, because the real response is

harder to reconstruct. Differential privacy is not focused on

the method, but on the result: how well is the privacy of the

user protected?

IV. OVERVIEW OF PRIVACY PRESERVING METHODS

Data mining in the health care sector can improve, e.g.,

detection of diseases, but requires to ensure the patients

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MethodsToEnsurePrivacyRegardingMedicalDataIncludinganexaminationofthedifferentialprivacyalgorithmRAPPORanditsimplementationinCrypTool2ChristinaW¨olkUniversityofSiegenStudent(Bachelor'sDegreeinComputerScience)christina.woelk@student.uni-siegen.deAbstractThisdocumentexaminesseveralapplicablemethods...

展开>> 收起<<

Methods To Ensure Privacy Regarding Medical Data Including an examination of the differential privacy algorithm RAPPOR and its implementation in CrypTool 2.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Methods To Ensure Privacy Regarding Medical Data Including an examination of the differential privacy algorithm RAPPOR and its implementation in CrypTool 2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: