All about image-to-image translation (ongoing..)

Table of content (short-version)

UNIT
MUNIT

UNIT

Research area
- Image-to-image translation: mapping an image in one domain to a corresponding image in another domain
- Unsupervised I2I: there exist no paired examples (this one)
Challenging issues in I2I problems
- I2I problem: key challenge is to learn a joint distribution of images in different domains (probabilistic modeling perspective)
- Unsupervised I2I problem: two sets consist of images from two marginal distributions (diff modal) to ifner the joint distribution
  - Technical
    - Goal is to estimate the two conditionals $p(x_2 \rvert x_1), p(x_1 \rvert x_2)$ with learned I2I translation models $p(x_{1 \rightarrow 2} \rvert x_1), p(x_{2 \rightarrow 1} \rvert x_2)$
    - $p(x_2 \rvert x_1), p(x_1 \rvert x_2)$ are complex and multimodal distributions
- The coupling theory [2]: there exist an infinite set of joint distributions that can arrive the given marginal distributions in general
  - Therefore, inferring the joint distribution from the marginal distributions is a highly ill-posed problem
Method
- Make a shared-latent space assumption
  - A pair of corresponding images in different domains can be mapped to a same latent representation $z$ in a shared-latent space $X$
- Basic concept: coupled GAN [3]
- Architecture: 2 encoders, 2 decoders, 2 discriminators, high-level layers are tied
- Role
  - ${E_1, G_1}$ : VAE for $X$
  - ${E_1, G_2}$ : Image translator $X_1 \rightarrow X_2$
  - ${G_1, D_1}$ : GAN for $X_1$
  - ${E_1, G_1, D_1}$ : VAE-GAN [4]
  - ${G_1, G_2, D_1, D_2}$ : CoGAN [3]
- Loss
  - $L_1$ : $L_{VAE_1}(E_1, G_1) + L_{GAN_1}(E_2, G_1, D_1) + L_{CC_1}(E_1, G_1, E_2, G_2)$
  - $L_2$ : $L_{VAE_2}(E_2, G_2) + L_{GAN_2}(E_1, G_2, D_2) + L_{CC_2}(E_2, G_2, E_1, G_1)$
References
- [1] Liu, Ming-Yu, Thomas Breuel, and Jan Kautz. “Unsupervised image-to-image translation networks.” Advances in neural information processing systems. 2017.
- [2] T. Lindvall. Lectures on the coupling method. Courier Corporation, 2002.
- [3] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. Advances in Neural Information Processing Systems, 2016.
- [4] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. International Conference on Machine Learning, 2016.

MUNIT

Research area: Unsupervised I2I
Challening issues
- Existing methods: assume a deterministic or unimodal mapping (UNIT) mapping $\rightarrow$ fail to capture the full distribution of possible outputs
Method
- Make a partially shared latent space assumption
  - Assume that the image representation can be decomposed into a shared content code (domain-invariant) and a style code (domain-specific)
  - To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain
  - We futher assume that $G_1, G_2$ are deterministic functions and have their inverse encoders $E_1, E_2$
  - Note that although the encoders and decoders are deterministic, $p(x_2 \rvert x_1)$ is a continuous distribution due to the dependency of the style code
  - Enable many to many cross domain mapping
- Architecture
  - Content encoder (downsampling and residual block) - content code - Residual block*(combine) - upsampling
  - Style encoder (downsampling, GAP, and residual block) - style code - MLP - Adain parameters*(combine) - upsampling
  - Randomly draw style code from prior distribution
    - Although the prior distribution is unimodal, the output image distribution can be multimodal thanks to the nonlinearity of the decoder
- Loss
  - Bidirectional reconstruction loss: ensures the encoders and decoders are inverses
    - Image reconstruction: $L^{x_1}_{recon} = \mathbb{E} [ \| G_1(c_1, s_1) - x_1 \|_1 ]$
    - Content reconstruction: $L^{c_1}_{recon} = \mathbb{E} [ \| E^c_2(G_2(c_1, s_2)) - c_1 \|_1 ]$
    - Style reconstruction: $L^{s_1}_{recon} = \mathbb{E} [ \| E^s_1(G_1(c_2, s_1)) - s_1 \|_1 ]$
    - Meaning
      - L1 norm: encourage sharp
      - Style recon: encourage diverse outputs given different style codes
      - Content recon: encourage the translated image to preserve semantic content of the input image
  - Adversarial loss: matches the distribution of translated images to the image distribution in the target domain
    - GAN: $L^{x_1}_{GAN} = \mathbb{E} [\log ( 1 - D_1 ( G_1 (c_2, s_1)))] + \mathbb{E} [\log D_1 (x_1)]$
    - Meaning
      - D1: distinguish between translated image and real image
      - G1: deceive D1 not to distinguish real/fake
Related work: BicycleGAN that can model continuous and multimodal distributions, but, it requires pair supervision
Reference
- [1] Huang, Xun, et al. “Multimodal unsupervised image-to-image translation.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.