Adversarial network training using higher-order moments in a modied Wasserstein distance Oliver Serang

2025-04-27 0 0 3.6MB 12 页 10玖币

侵权投诉

Adversarial network training using higher-order

moments in a modiﬁed Wasserstein distance

Oliver Serang

A-Alpha Bio, Seattle, WA, USA

oserang@aalphabio.com

October 10, 2022

Abstract

Generative-adversarial networks (GANs) have been used to produce data

closely resembling example data in a compressed, latent space that is close

to suﬃcient for reconstruction in the original vector space. The Wasser-

stein metric has been used as an alternative to binary cross-entropy, pro-

ducing more numerically stable GANs with greater mode covering behav-

ior. Here, a generalization of the Wasserstein distance, using higher-order

moments than the mean, is derived. Training a GAN with this higher-

order Wasserstein metric is demonstrated to exhibit superior performance,

even when adjusted for slightly higher computational cost. This is illus-

trated generating synthetic antibody sequences.

1 Introduction

1.1 Generative-adversarial network

The generative-adversarial network (GAN) is a game-theoretic technique for

generating values according to a latent distribution estimated on nexample

data x∈Rn×`×u.[1] GANs employ a generator, g:Ry→R`×u, which maps

high-entropy inputs to an immitation datum; these high-entropy inputs ∈Ry

eﬀectively determine a location in the latent space and are decoded to produces

an immitation datum. GANs also employ a discriminator, d:R`×u→[0,1],

which is used to evaluate the plausibility that a datum is genuine. Generator

and discriminator are trained in an adversarial manner, with the goal of reaching

an equilibrium where both implicitly encode the distribution of real data in the

latent space. If training is successful, ˆ

X=g(Z) (where Z∼ N (0,1)y) will

produce data resembling a row of x;dRkwill correspond to the cumulative

density in a unimodal latent space where the latent space density projects the

empirical distribution of xi.

arXiv:2210.03354v1 [stat.ML] 7 Oct 2022

1.2 Cross-entropy loss

GANs are typically trained using a cross-entropy loss to optimize the parameters

of both gand d, which measures the expected bits of surprise that samples from

a foreground distribution would produce if they had been drawn from a back-

ground distribution. The parameters θgare optimized to minimize the surprise

of the Bernoulli distribution 1 −Σ, d(g(Z)) given the background distribution

0,1 (i.e., minimizing the surprise from a background that scores d(ˆ

X) = 1).

The parameters θdare optimized to minimize the surprise of the Bernoulli dis-

tribution 1 −Σ, d(x) given the background distribution with d(xi) = 1 and

dˆ

X= 0.

1.3 Wasserstein metric loss

When two distributions are highly dissimilar from one another, their support

may be distinct such that cross-entropy becomes numerically unstable. This

causes uninformative loss metrics: two distributions with non-overlapping sup-

port are quantiﬁed identically to two distributions whose supports are non-

overlapping and very far from one another. These factors lead to poor training,

particularly given that gwill initially produce noise, which will quite likely have

poor overlap with real data in the latent space.

For this reason, Wasserstein distance was proposed to replace cross-

entropy.[2] Wasserstein distance is the continuous version of the discrete earth-

mover distance, which solves an optimal transport problem measuring the min-

imal movements in Euclidean distance that could be used to transform one

probability density to another. Earth-mover distance is well deﬁned, even when

the two distributions have disjoint support. This avoids modal collapse while

training.

If earth-mover distance is used to measure the distance between distribu-

tions pAand pB, then the set of candidate solutions γwill be functions with

domain supp(pA)×supp(pB) and where the marginals equal pAand pB. Thus,

∆EM (pA, pB) = infγ∈Π(pA,pB)Ea,b∼γka−bk, where Π(pA, pB) is the set of dis-

tributions with marginals pA, pB.

The discrete formulation can be solved combinatorically via LP; however, the

continuous formulation, Wasserstein distance, is computed via the Kantorovich-

Rubinstein dual[3], which we show below.

∆W(pA, pB) = inf

γ∈Π(pA,pB)

Ea,b∼γka−bk

= inf

Ea,b∼γka−bk+(0, γ ∈Π(pA, pB)

∞,else.

The penalty term, here named λ(pA, pB, γ), can be recreated using an ad-

versarial critic function, f, which has a unitless codomain:

λ(pA, pB, γ) =

sup

Ea0∼pA[f(a0)] −Eb0∼pB[f(b0)] −Ea,b∼γ[f(a)−f(b)] =

(0, γ ∈Π(pA, pB)

∞,else.

λ(pA, pB, γ) = ∞is achieved when γ6∈ Π(pA, pB) because fcan be made s.t.,

w.l.o.g., |f(a)|  1 at the value awhere pA(a)6=R∞

∞γ(a, b)∂b.

Thus,

∆W(pA, pB) =

inf

γsup

Ea,b∼γ[ka−bk+f(b)−f(a)] + Ea0∼pA[f(a0)] −Eb0∼pB[f(b0)] .

We can further reorder infγsupfto supfinfγ: For any function t,h(β) =

infat(α, β), and δ= infαsupβt(α, β) = infαh(β), and so ∀α, h(β)≤t(α, β).

Thus infαsupβt(α, β)≥infαsupβh(α) = infαδ=δ(i.e., weak duality). Fur-

thermore, if tis convex in αand concave in β, then the minimax principle yields

infαsupβt(α, β) = supβinfαt(α, β) (i.e., strong duality). Because ∆Wis con-

vex in γ(here manifest via convexity in a, b) and concave in f(manifest via

concave uses of frather then concavity of fitself), we have

∆W(pA, pB) =

sup

inf

Ea,b∼γ[ka−bk+f(b)−f(a)] + Ea0∼pA[f(a0)] −Eb0∼pB[f(b0)] .

= sup

Ea0∼pA[f(a0)] −Eb0∼pB[f(b0)] + inf

Ea,b∼γ[ka−bk+f(b)−f(a)] .

infγis achieved by concentrating the mass of γwhere ka−bk+f(b)−

f(a)<0 and setting γ= 0 wherever ka−bk+f(b)−f(a)≥0. Thus

infγEa,b∼γ[ka−bk+f(b)−f(a)] ≤0. This constrains that where f(a)−f(b)

ka−bk>

1, the dual penalty term will become −∞, and so we need only consider fs.t.

f(a)−f(b)

ka−bk≤1. This is equivalent to constraining fs.t. all secants having a

maximum slope ≤1 (i.e., Lipschitz kfkL≤1) yields the weakest penalty, 0:

∆W(pA, pB) = sup

f:kfkL≤1

Ea0∼pA[f(a0)] −Eb0∼pB[f(b0)] .

In WGAN training, our critic functions as f, exploiting diﬀerences between

real and generated sequences. The critic loss function is simply the diﬀerence

between mean critic values of generated sequences minus mean critic values

of real sequences; minimizing this loss will maximize discrimination, with real

sequences awarded higher critic scores. With the goal of attaining Lipschitz

continuity on f, we constrain its parameters θf, clipping them to small values

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Adversarialnetworktrainingusinghigher-ordermomentsinamodiedWassersteindistanceOliverSerangA-AlphaBio,Seattle,WA,USAoserang@aalphabio.comOctober10,2022AbstractGenerative-adversarialnetworks(GANs)havebeenusedtoproducedatacloselyresemblingexampledatainacompressed,latentspacethatisclosetosucientforrec...

展开>> 收起<<

Adversarial network training using higher-order moments in a modied Wasserstein distance Oliver Serang.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adversarial network training using higher-order moments in a modied Wasserstein distance Oliver Serang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: