Table of content (short-version)


UNIT

  • Research area
    • Image-to-image translation: mapping an image in one domain to a corresponding image in another domain
    • Unsupervised I2I: there exist no paired examples (this one)
  • Challenging issues in I2I problems
    • I2I problem: key challenge is to learn a joint distribution of images in different domains (probabilistic modeling perspective)
    • Unsupervised I2I problem: two sets consist of images from two marginal distributions (diff modal) to ifner the joint distribution
      • Technical
        • Goal is to estimate the two conditionals with learned I2I translation models
        • are complex and multimodal distributions
    • The coupling theory [2]: there exist an infinite set of joint distributions that can arrive the given marginal distributions in general
      • Therefore, inferring the joint distribution from the marginal distributions is a highly ill-posed problem
  • Method
    • Make a shared-latent space assumption
      • A pair of corresponding images in different domains can be mapped to a same latent representation in a shared-latent space
    • Basic concept: coupled GAN [3]
    • Architecture: 2 encoders, 2 decoders, 2 discriminators, high-level layers are tied
    • Role
      • : VAE for
      • : Image translator
      • : GAN for
      • : VAE-GAN [4]
      • : CoGAN [3]
    • Loss
      • :
      • :
  • References
    • [1] Liu, Ming-Yu, Thomas Breuel, and Jan Kautz. “Unsupervised image-to-image translation networks.” Advances in neural information processing systems. 2017.
    • [2] T. Lindvall. Lectures on the coupling method. Courier Corporation, 2002.
    • [3] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. Advances in Neural Information Processing Systems, 2016.
    • [4] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. International Conference on Machine Learning, 2016.

MUNIT

  • Research area: Unsupervised I2I
  • Challening issues
    • Existing methods: assume a deterministic or unimodal mapping (UNIT) mapping fail to capture the full distribution of possible outputs
  • Method
    • Make a partially shared latent space assumption
      • Assume that the image representation can be decomposed into a shared content code (domain-invariant) and a style code (domain-specific)
      • To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain
      • We futher assume that are deterministic functions and have their inverse encoders
      • Note that although the encoders and decoders are deterministic, is a continuous distribution due to the dependency of the style code
      • Enable many to many cross domain mapping
    • Architecture
      • Content encoder (downsampling and residual block) - content code - Residual block*(combine) - upsampling
      • Style encoder (downsampling, GAP, and residual block) - style code - MLP - Adain parameters*(combine) - upsampling
      • Randomly draw style code from prior distribution
        • Although the prior distribution is unimodal, the output image distribution can be multimodal thanks to the nonlinearity of the decoder
    • Loss
      • Bidirectional reconstruction loss: ensures the encoders and decoders are inverses
        • Image reconstruction:
        • Content reconstruction:
        • Style reconstruction:
        • Meaning
          • L1 norm: encourage sharp
          • Style recon: encourage diverse outputs given different style codes
          • Content recon: encourage the translated image to preserve semantic content of the input image
      • Adversarial loss: matches the distribution of translated images to the image distribution in the target domain
        • GAN:
        • Meaning
          • D1: distinguish between translated image and real image
          • G1: deceive D1 not to distinguish real/fake
  • Related work: BicycleGAN that can model continuous and multimodal distributions, but, it requires pair supervision
  • Reference
    • [1] Huang, Xun, et al. “Multimodal unsupervised image-to-image translation.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.