
2
where m∈Rqis the model vector and S=P`PkS`,k.
F`,k(·)is the local loss function for user k, where
F`,k(m) = 1
S`,k X
(uk,j ,vk,j )∈S`,k
f(m;uk,j , vk,j ) + λR(m),(2.2)
and f(m;uk,j , vk,j )is the sample-wise loss function. R(m)is
a strongly convex regularization function and λ≥0. The model
training by minimizing the global loss function as
m?= arg min
m
F(m).(2.3)
To minimize F(m), we use a distributed gradient descent iter-
ative algorithm. Specifically, in the t-th (t∈ {1,2, ..., T }) com-
munication round (the overall communication round is T), each
user kcomputes its own local gradient ∇F`,k(mt)and the users
send the corrupted local gradients (added by Gaussian noises
for LDP) to the edge servers. Then, the edge server `computes
its estimation d
∇F`(mt)of the partial gradient and ∇F`(mt) =
1
S`Pk∈C`S`,k∇F`,k(mt), where S`=|S`|is the total number
of S`. The cloud server’s estimation d
∇F(mt)of the global
gradient is given by ∇F(mt) = 1
SPL
`=1 S`∇F`(mt). The
global model mt+1 updated by the cloud server is given by
mt+1 =mt−µd
∇F(mt),(2.4)
where µis the learning rate. For convenience, in the t-th
communication round, we denote Wt,k =S`,k∇F`,k (mt).
B. Model formulation
In this paper, we assume that each edge server communicates
with the cloud server without interference from other edge
servers. Besides, we assume that the downlink communication
is perfect, which is similar to [2], and eavesdropper only shows
interest in the data transmitted in the uplink communication
between the edge servers and the cloud server. Hence we only
focus on the Trounds uplink communication between one
of the edge servers and the cloud server. An information-
theoretic approach of Figure 1 is illustrated in Figure 2. For
simplification, we make the following assumptions:
•Similar to [2]-[3], we assume that the channel coefficients
stay constants during the transmission (quasi-static fading
channel).
•Similar to [3]-[4], we assume that the cloud server and the
edge server have perfect channel state information (CSI)
of the feedforward channel and feedback channel.
•From similar arguments in [11], we assume that eavesdrop-
per is an active user but it is un-trusted by the cloud server,
which indicates that the perfect CSI of eavesdropper’s
channel is known by the eavesdropper and the edge server.
Moreover, we assume that the eavesdropper also knows the
perfect CSI of the edge server-cloud server’s channels.
Information source: In Figure 2(a), we assume that Wt,k ∈
Rqis the k-th (k∈ {1,2, ..., K}) user’s overall local gradient
vector in t-th (t∈ {1,2, ..., T }) communication round, where
Wt,k = (Wt,k,1, ..., Wt,k,q)T. Similar to [12], the elements
of Wt,k are independent and identically distributed (i.i.d.)
and Wt,k ∼ N(0, S`,kσ2
w,tI). Let ηt,k = (ηt,k,1, ..., ηt,k,q)T
be local artificial Gaussian noise i.i.d. according to distri-
bution N(0, σ2I). The corrupted local gradient W0
t,k =
(a) An information-theoretic approach of Figure 1:
encoding
(b) An information-theoretic approach of Figure 1: decoding
Figure 2: An information-theoretic approach of Figure 1
(W0
t,k,1, ..., W 0
t,k,q )Tthat is aggregated by the edge server is
given by
W0
t,k =Wt,k +ηt,k,(2.5)
where W0
t,k ∼ N(0,(S`,kσ2
w,t +σ2)I)for k∈ {1,2, ..., K}.
The overall local gradients and the overall noises are defined
as Wt= (Wt,1, ..., Wt,q)Tand ηt= (ηt,1, ..., ηt,q)T, re-
spectively, where Wt,i =PK
k=1 Wt,k,i,ηt,i =PK
k=1 ηt,k,i
(i∈ {1,2, ..., q}). According to (2.5), we define the over-
all corrupted local gradients sent to the edge server as
W0
t= (W0
t,1, ..., W 0
t,q)T, where W0
t,i =PK
k=1 W0
t,k,i (i∈
{1,2, ..., q}). Here note that since Wt,k and ηt,k are i.i.d.
generated, W0
tis also composed of i.i.d. components, where
W0
t∼ N(0,(S`σ2
w,t +Kσ2)I).
Definition 1 (Privacy by mutual information [13]): If
the mutual information between Wtand W0
tsatisfies
1
qT PT
t=1 I(Wt;W0
t)≤, we say the LDP mechanism satisfies
-mutual-information privacy for some > 0.
Definition 2 (Utility by quadratic distortion [14]): The
utility of W0
tis characterized by d(Wt,W0
t) = ||W0
t−
Wt||2, where ||X|| represents the l2-norm of the vector X. If
1
qT PT
t=1 E(d(Wt,W0
t)) ≤U, we say the utility of W0
tis up
to U.
Channels: At time instant i(i∈ {1,2, ..., Nt}of t-th
communication round, channel inputs and outputs are given by
Yi(t) = hXi(t) + η1,i(t), i = 1,2, ..., Nt,(2.6)
e
Yi(t) = e
he
Xi(t) + η2,i(t), i = 1,2, ..., Nt−1,(2.7)
Zi(t) = gXi(t) + ege
Xi(t) + ηe,i(t), i = 1,2, ..., Nt,(2.8)
where Xi(t)and e
Xi(t)respectively are the feedforward
and feedback channel inputs, which satisfy the average
power constraints 1
NtPNt
i=1 E[Xi(t)Xi(t)H]≤Pand
1
Nt−1PNt−1
i=1 E[e
Xi(t)e
Xi(t)H]≤e
P.h, e
h, g, eg∈Care the CSI
of the feedforward and feedback channels of the cloud server,