
to other privacy quantifications. Further, we do not focus on
aggregating multiple methods to make new DP frameworks.
We only present performances of loss functions on EM
processed data. Unlike the examples in [4] that purely
pursue better performance with unsupervised or ensemble
learning pre-processing, our experiments still demonstrate the
insufficiency of only depending on loss functions––especially
while the privacy budget is small. However, as loss functions
are fundamental elements in ML, testing them can provide
a reference of modifications for a series of ML-DP hybrid
frameworks.
C. Paper Organization
The rest of the paper is organized as follows. In section
II, we present an introduction to symmetric loss with its
properties, and EM with its implementation. In section III, we
discuss some theoretical properties of EM. In section IV, we
show the experiment results on the CIFAR-10 dataset, which
shows the robustness of barrier hinge loss on EM.
According to PDAA regular paper format, proofs of
properties in section III are moved to Appendix, which is not
included in the publication. One can find the full version on
the ArXiv.
II. PRELIMINARIES
A. Symmetric Loss in Machine Learning
Any given data can be represented by a series of real values,
a(d+k)-tuple (x∈Rd,y∈Rk). In ML, xare called features,
and yare called labels. Our dataset Dis a collection of those
tuples; i.e., D:= hxi,yiin
i=1 is a dataset with size n. It is
not necessary to have all meaningful values for all entries in a
tuple in a dataset D. For example, we may have 0’s or wrong
values for some entries. Normally, features are treated as fully
and correctly known elements; otherwise, we can move the
unclear entry or entries to the label side.
Machine learning is meant to find the internal relationship
between features and labels in a set, and make predictions for
prospective data. Traditionally, if all y’s are correctly valued,
the process is called supervised learning; if they are partially
unclear, the process is called semi-supervised learning; if
all labels are unknown, the process is called unsupervised
learning. Formally, a function f:Rd→Rk, which can
describe this relation as best as possible for the whole dataset
is the learning goal (i.e., |f(x)−y|= 0 for all (x,y)’s is
desired).
In this work, we consider binary classification problems,
which is the task when y∈ {1,−1}. We call (x,+1) and
(x,−1) as positive and negative data, respectively. We also
define DP={(x,y)∈D:y= 1}and DN={(x,y)∈D:
y=−1}.
The function fis controlled by parameters and hyper-
parameters. The learning process is to optimize these
parameters through given objectives. An intuitive objective,
for example, can be Bayes risk R(f) = E[l(f(x),y)], where
lsatisfying the definition of the distance function is called the
loss function. Since it is impossible to know and calculate all
possible or potential data in the same problem, we normally
restrict the expectations to certain sizes, which are named
empirical risks, in the form of 1
|D|P(x,y)∈Dl(f(x),y).
Trivial empirical risk unbiasedly weighs the losses from
every data point, which outputs functions that suffer from
disparate data treats. Manipulating it is necessary and,
and different predictors can clearly be derived. In binary
classification problems, to fairly evaluate the errors from
positive and negative labeled data, one would leverage the
receiver operating characteristic curve (AUC) [13], or the
balanced error rate (BER) [14], [15]. AUC and BER are
defined in the following definitions.
Definition 1. Empirical AUC-risk is defined as:
RAUC =EDP[EDN[l(f(xP)−f(xN)]]
=P(xP,1)∈DPP(xN,−1)∈DNl(f(xP)−f(xN))
|DP||DN|
Definition 2. Empirical BER-risk is defined as:
RBER =1
2[EDP[l(yf(x))] + EDN[l(yf(x))]]
=1
2
X
(x,1)∈DP
l(f(x)) + X
(x,−1)∈DN
l(−f(x))
The restrictions to certain datasets still carry in distances
between empirical minimizers and Bayes minimizers. Let f∗
be the optimal function that describes the relation of the
problem, and Fis a collection of functions restricted to
D. Such distances are described in a Bayes sense R(f∗)−
inff∈F R(f), named excess risks [16]. Classical excess risks
are defined on a 0-1 loss function, which is the function
l01 such that l01(x)=1for y≤0, and l01(x)=0
otherwise. However, because of the non-differentiability and
non-convexity of a 0-1 loss at the crucial point 0, which
harms the learning fluency and cost, we normally adopt other
(surrogate) losses instead––like a logic sigmoid function, or
hinge functions, and so forth.
The unavoidable issue is how to guarantee the minimizer
of surrogate excess risk to coincide with the minimizer of 0-1
excess risk. This fundamental property is called classification-
calibration, and has been proven to be satisfied for most
commonly used surrogate losses (e.g., exponential, quadratic,
hinge, and sigmoid [16]). [5] has extended this property to
also be satisfied by symmetric functions, which are function
f’s such that there is Cwhere f(x) + f(−x) = Cfor all x.
(Therefore, the function is symmetric about the intersection
point with y-axis, not symmetric about y-axis.) Despite this
merit, normal symmetric functions cannot simultaneously
satisfy non-negative values, and convexity [17], [18] (except
for the horizontal lines). To cope with the issue, [5] proposes
the use of a restricted symmetric loss, called a barrier hinge
function, which can be defined as follows:
Definition 3 ( [5]).A barrier hinge function with parameter
b, r can be defined as follows:
fb,r(x) = max(−b(r+x) + r, max(b(x−r), r −x)).
2