Adversarial network training using higher-order moments in a modied Wasserstein distance Oliver Serang

2025-04-27 0 0 3.6MB 12 页 10玖币
侵权投诉
Adversarial network training using higher-order
moments in a modified Wasserstein distance
Oliver Serang
A-Alpha Bio, Seattle, WA, USA
oserang@aalphabio.com
October 10, 2022
Abstract
Generative-adversarial networks (GANs) have been used to produce data
closely resembling example data in a compressed, latent space that is close
to sufficient for reconstruction in the original vector space. The Wasser-
stein metric has been used as an alternative to binary cross-entropy, pro-
ducing more numerically stable GANs with greater mode covering behav-
ior. Here, a generalization of the Wasserstein distance, using higher-order
moments than the mean, is derived. Training a GAN with this higher-
order Wasserstein metric is demonstrated to exhibit superior performance,
even when adjusted for slightly higher computational cost. This is illus-
trated generating synthetic antibody sequences.
1 Introduction
1.1 Generative-adversarial network
The generative-adversarial network (GAN) is a game-theoretic technique for
generating values according to a latent distribution estimated on nexample
data xRn×`×u.[1] GANs employ a generator, g:RyR`×u, which maps
high-entropy inputs to an immitation datum; these high-entropy inputs Ry
effectively determine a location in the latent space and are decoded to produces
an immitation datum. GANs also employ a discriminator, d:R`×u[0,1],
which is used to evaluate the plausibility that a datum is genuine. Generator
and discriminator are trained in an adversarial manner, with the goal of reaching
an equilibrium where both implicitly encode the distribution of real data in the
latent space. If training is successful, ˆ
X=g(Z) (where Z N (0,1)y) will
produce data resembling a row of x;dRkwill correspond to the cumulative
density in a unimodal latent space where the latent space density projects the
empirical distribution of xi.
arXiv:2210.03354v1 [stat.ML] 7 Oct 2022
1.2 Cross-entropy loss
GANs are typically trained using a cross-entropy loss to optimize the parameters
of both gand d, which measures the expected bits of surprise that samples from
a foreground distribution would produce if they had been drawn from a back-
ground distribution. The parameters θgare optimized to minimize the surprise
of the Bernoulli distribution 1 Σ, d(g(Z)) given the background distribution
0,1 (i.e., minimizing the surprise from a background that scores d(ˆ
X) = 1).
The parameters θdare optimized to minimize the surprise of the Bernoulli dis-
tribution 1 Σ, d(x) given the background distribution with d(xi) = 1 and
dˆ
X= 0.
1.3 Wasserstein metric loss
When two distributions are highly dissimilar from one another, their support
may be distinct such that cross-entropy becomes numerically unstable. This
causes uninformative loss metrics: two distributions with non-overlapping sup-
port are quantified identically to two distributions whose supports are non-
overlapping and very far from one another. These factors lead to poor training,
particularly given that gwill initially produce noise, which will quite likely have
poor overlap with real data in the latent space.
For this reason, Wasserstein distance was proposed to replace cross-
entropy.[2] Wasserstein distance is the continuous version of the discrete earth-
mover distance, which solves an optimal transport problem measuring the min-
imal movements in Euclidean distance that could be used to transform one
probability density to another. Earth-mover distance is well defined, even when
the two distributions have disjoint support. This avoids modal collapse while
training.
If earth-mover distance is used to measure the distance between distribu-
tions pAand pB, then the set of candidate solutions γwill be functions with
domain supp(pA)×supp(pB) and where the marginals equal pAand pB. Thus,
EM (pA, pB) = infγΠ(pA,pB)Ea,bγkabk, where Π(pA, pB) is the set of dis-
tributions with marginals pA, pB.
The discrete formulation can be solved combinatorically via LP; however, the
continuous formulation, Wasserstein distance, is computed via the Kantorovich-
Rubinstein dual[3], which we show below.
W(pA, pB) = inf
γΠ(pA,pB)
Ea,bγkabk
= inf
γ
Ea,bγkabk+(0, γ Π(pA, pB)
,else.
The penalty term, here named λ(pA, pB, γ), can be recreated using an ad-
2
versarial critic function, f, which has a unitless codomain:
λ(pA, pB, γ) =
sup
f
Ea0pA[f(a0)] Eb0pB[f(b0)] Ea,bγ[f(a)f(b)] =
(0, γ Π(pA, pB)
,else.
λ(pA, pB, γ) = is achieved when γ6∈ Π(pA, pB) because fcan be made s.t.,
w.l.o.g., |f(a)|  1 at the value awhere pA(a)6=R
γ(a, b)b.
Thus,
W(pA, pB) =
inf
γsup
f
Ea,bγ[kabk+f(b)f(a)] + Ea0pA[f(a0)] Eb0pB[f(b0)] .
We can further reorder infγsupfto supfinfγ: For any function t,h(β) =
infat(α, β), and δ= infαsupβt(α, β) = infαh(β), and so α, h(β)t(α, β).
Thus infαsupβt(α, β)infαsupβh(α) = infαδ=δ(i.e., weak duality). Fur-
thermore, if tis convex in αand concave in β, then the minimax principle yields
infαsupβt(α, β) = supβinfαt(α, β) (i.e., strong duality). Because ∆Wis con-
vex in γ(here manifest via convexity in a, b) and concave in f(manifest via
concave uses of frather then concavity of fitself), we have
W(pA, pB) =
sup
f
inf
γ
Ea,bγ[kabk+f(b)f(a)] + Ea0pA[f(a0)] Eb0pB[f(b0)] .
= sup
f
Ea0pA[f(a0)] Eb0pB[f(b0)] + inf
γ
Ea,bγ[kabk+f(b)f(a)] .
infγis achieved by concentrating the mass of γwhere kabk+f(b)
f(a)<0 and setting γ= 0 wherever kabk+f(b)f(a)0. Thus
infγEa,bγ[kabk+f(b)f(a)] 0. This constrains that where f(a)f(b)
kabk>
1, the dual penalty term will become −∞, and so we need only consider fs.t.
f(a)f(b)
kabk1. This is equivalent to constraining fs.t. all secants having a
maximum slope 1 (i.e., Lipschitz kfkL1) yields the weakest penalty, 0:
W(pA, pB) = sup
f:kfkL1
Ea0pA[f(a0)] Eb0pB[f(b0)] .
In WGAN training, our critic functions as f, exploiting differences between
real and generated sequences. The critic loss function is simply the difference
between mean critic values of generated sequences minus mean critic values
of real sequences; minimizing this loss will maximize discrimination, with real
sequences awarded higher critic scores. With the goal of attaining Lipschitz
continuity on f, we constrain its parameters θf, clipping them to small values
3
摘要:

Adversarialnetworktrainingusinghigher-ordermomentsinamodi edWassersteindistanceOliverSerangA-AlphaBio,Seattle,WA,USAoserang@aalphabio.comOctober10,2022AbstractGenerative-adversarialnetworks(GANs)havebeenusedtoproducedatacloselyresemblingexampledatainacompressed,latentspacethatisclosetosucientforrec...

展开>> 收起<<
Adversarial network training using higher-order moments in a modied Wasserstein distance Oliver Serang.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:3.6MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注