All about image-to-image translation (ongoing..)
Table of content (short-version)
UNIT
- Research area
- Image-to-image translation: mapping an image in one domain to a corresponding image in another domain
- Unsupervised I2I: there exist no paired examples (this one)
- Challenging issues in I2I problems
- I2I problem: key challenge is to learn a joint distribution of images in different domains (probabilistic modeling perspective)
- Unsupervised I2I problem: two sets consist of images from two marginal distributions (diff modal) to ifner the joint distribution
- Technical
- Goal is to estimate the two conditionals with learned I2I translation models
- are complex and multimodal distributions
- Technical
- The coupling theory [2]: there exist an infinite set of joint distributions that can arrive the given marginal distributions in general
- Therefore, inferring the joint distribution from the marginal distributions is a highly ill-posed problem
- Method
- Make a shared-latent space assumption
- A pair of corresponding images in different domains can be mapped to a same latent representation in a shared-latent space
- Basic concept: coupled GAN [3]
- Architecture: 2 encoders, 2 decoders, 2 discriminators, high-level layers are tied
- Role
- : VAE for
- : Image translator
- : GAN for
- : VAE-GAN [4]
- : CoGAN [3]
- Loss
- :
- :
- Make a shared-latent space assumption
- References
- [1] Liu, Ming-Yu, Thomas Breuel, and Jan Kautz. “Unsupervised image-to-image translation networks.” Advances in neural information processing systems. 2017.
- [2] T. Lindvall. Lectures on the coupling method. Courier Corporation, 2002.
- [3] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. Advances in Neural Information Processing Systems, 2016.
- [4] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. International Conference on Machine Learning, 2016.
MUNIT
- Research area: Unsupervised I2I
- Challening issues
- Existing methods: assume a deterministic or unimodal mapping (UNIT) mapping fail to capture the full distribution of possible outputs
- Method
- Make a partially shared latent space assumption
- Assume that the image representation can be decomposed into a shared content code (domain-invariant) and a style code (domain-specific)
- To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain
- We futher assume that are deterministic functions and have their inverse encoders
- Note that although the encoders and decoders are deterministic, is a continuous distribution due to the dependency of the style code
- Enable many to many cross domain mapping
- Architecture
- Content encoder (downsampling and residual block) - content code - Residual block*(combine) - upsampling
- Style encoder (downsampling, GAP, and residual block) - style code - MLP - Adain parameters*(combine) - upsampling
- Randomly draw style code from prior distribution
- Although the prior distribution is unimodal, the output image distribution can be multimodal thanks to the nonlinearity of the decoder
- Loss
- Bidirectional reconstruction loss: ensures the encoders and decoders are inverses
- Image reconstruction:
- Content reconstruction:
- Style reconstruction:
- Meaning
- L1 norm: encourage sharp
- Style recon: encourage diverse outputs given different style codes
- Content recon: encourage the translated image to preserve semantic content of the input image
- Adversarial loss: matches the distribution of translated images to the image distribution in the target domain
- GAN:
- Meaning
- D1: distinguish between translated image and real image
- G1: deceive D1 not to distinguish real/fake
- Bidirectional reconstruction loss: ensures the encoders and decoders are inverses
- Make a partially shared latent space assumption
- Related work: BicycleGAN that can model continuous and multimodal distributions, but, it requires pair supervision
- Reference
- [1] Huang, Xun, et al. “Multimodal unsupervised image-to-image translation.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.