Generative Adversarial Nets

最終更新: 2017-02-27 00:37

Generative Adversarial Nets


GAN is composed of two models:

pg(x)p_g(x): generator's distribution
pz(z)p_z(z): prior distribution on input noize variable zz
G(z;θg)G(z; \theta_g): differentiable function represented by MLP with parameter θg\theta_g
D(x;θd)D(x; \theta_d): differentiable function represented by MLP with parameter θd\theta_d that outputs a single scalar

The goal of the training is to maximize the probability to assign correct label for DD, and to minimize the probability to assign correct label for GG. Therefore, the objection is

minGmaxDV(D,G)=Expdata(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))]\min_G \max_D V(D, G) = \mathbb{E}_{x\sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z\sim p_z(z)}[\log(1-D(G(z)))]

In practice, it is better to train GG to maximize logD(G(z))\log D(G(z)), because log(1D(G(z)))\log(1-D(G(z))) easily saturates in early stage of training.

Theoretical Analysis


When GG is fixed, the optimial discriminator DD is
DG(x)=pdata(x)pdata(x)+pg(x) D^{*}_{G}(x) = \frac{p_{data}(x)}{p_{data}(x)+p_g(x)}

V(G,D)=xpdata(x)log(D(x))dx+zpz(z)log(1D(G(z)))dz=xpdata(x)log(D(x))dx+pg(x)log(1D(x))dx \begin{aligned} V(G, D) &= \int_x p_{data}(x) \log(D(x))dx + \int_z p_z(z) \log(1-D(G(z)))dz \\ \\ &= \int_x p_{data}(x) \log(D(x))dx + p_g(x) \log(1-D(x))dx \end{aligned}

The maximum of
f(x)=alog(x)+blog(1x) f(x) = a\log(x)+b\log(1-x)
is attained when x=aa+bx = \frac{a}{a+b} when the domain is limited to [0,1][0, 1], because
f(x)=axb1x f'(x) = \frac{a}{x}-\frac{b}{1-x}
and f(x)=0f'(x)=0 when x=aa+bx = \frac{a}{a+b}.

Theorem 1

The global optimum of C(G)=maxDV(G,D)C(G) = \max_D V(G, D) is achieved if and only if pg=pdatap_g = p_{data}

C(G)=maxDV(G,D)=Expdata(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))]=Expdata(x)[logpdata(x)pdata(x)+pg(x)]+Expg[logpdata(x)pg(x)+pg(x)]=Expdata(x)[log2+logpdata(x)(pdata(x)+pg(x))/2]+Expg[log2+logpdata(x)(pg(x)+pg(x))/2]=log4+KL(pdatapdata+pg2)+KL(pgpdata+pg2)=log4+2JSD(pdatapg) \begin{aligned} C(G) &= \max_D V(G, D) \\ &= \mathbb{E}_{x\sim p_{data}(x)}[\log D^*(x)] + \mathbb{E}_{z\sim p_z(z)}[\log(1-D^*(G(z)))] \\ \\ &= \mathbb{E}_{x\sim p_{data}(x)}\left[\log \frac{p_{data}(x)}{p_{data}(x)+p_g(x)}\right] + \mathbb{E}_{x\sim p_g}\left[\log \frac{p_{data}(x)}{p_{g}(x)+p_g(x)}\right] \\ \\ &= \mathbb{E}_{x\sim p_{data}(x)}\left[-\log 2 + \log \frac{p_{data}(x)}{(p_{data}(x)+p_g(x))/2}\right] + \mathbb{E}_{x\sim p_g}\left[-\log 2 + \log \frac{p_{data}(x)}{(p_{g}(x)+p_g(x))/2}\right] \\ \\ &= -\log 4 + \text{KL}\left(p_{data} \mid\mid \frac{p_{data}+p_g}{2}\right) + \text{KL}\left(p_g \mid\mid \frac{p_{data}+p_g}{2}\right) \\ \\ &= -\log 4 + 2 \cdot \text{JSD} (p_{data}\mid \mid p_g) \end{aligned}

This concludes the proof.

Proposition 2

If GG and DD have enough capacity, and each step of Algorithm 1, the discriminator is allowed to reach its optimum given GG, and pgp_g is updated so as to improve the criterion V(G,D)V(G, D), then pgp_g converges to pdatap_{data}

Since global minima can be attained by gradient descent and its global optima is attaied when pg=pdatap_g = p_{data}, the conclusion follows.