Table of content (full-version) [paper] [github]


Summary

  • Multi-label image recognition ๋ถ„์•ผ (์˜์ƒ๋งˆ๋‹ค ๋‹ค์ˆ˜์˜ label ์กด์žฌ)
    • Object๋Š” ์„œ๋กœ ๋ณต์žกํ•œ topology๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด, lable dependency๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ๋ถ„์•ผ โ†’ GCN


[Multi-label image recognition ๊ฐœ๋…]

picture

  • ์ „์ฒด ํ”„๋ ˆ์ž„์›Œํฌ
    • Representation learning
      • ์ž…๋ ฅ: (448 ร— 448) ์˜์ƒ
      • ๋ชจ๋“ˆ: ResNet101 ์— ์˜ํ•ด์„œ (2048 ร— 14 ร— 14) feature vector (ImageNet pretrained), GAP ์ ์šฉ
      • ์ถœ๋ ฅ: 2048-dim feature vector
    • Graph convolutional network
      • ์ž…๋ ฅ: (C ร— 300) word embedding features (pretrained, GLoVe [2])
      • ๋ชจ๋“ˆ: GCN 2๊ฐœ (1024, 2048 dimension)
        • ์ˆ˜์‹: H2=h(ห†AH1W1), H3=h(ห†AH2W2)
        • H: learnable transformation network
        • A: correlation matrix (^(โ‹…) normalized)
          • Data-driven way: ํ•™์Šต ์…‹์— ์žˆ๋Š” label pair๋ฅผ ์ด์šฉ, ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜
          • Assymetric: ์˜์ƒ์— ์‚ฌ๋žŒ์ด ์žˆ์„ ๋•Œ ํ…Œ๋‹ˆ์Šค ๋ผ์ผ“๊นŒ์ง€ ํฌํ•จ๋˜๋Š” ๊ฒƒ์ด, ํ…Œ๋‹ˆ์Šค ๋ผ์ผ“์žˆ์„ ๋•Œ ์‚ฌ๋žŒ์ด ํฌํ•จ๋  ํ™•๋ฅ ๋ณด๋‹ค ์ ๋‹ค.
          • Binary correlation matrix: ํฌ๊ท€ํ•œ label pair๋Š” ์˜คํžˆ๋ ค noise๊ฐ€ ๋  ์ˆ˜ ์žˆ๊ธฐ์— ์ž„๊ณ„๊ฐ’์„ ํ†ตํ•œ (0,1) ์ด์‚ฐํ™”
          • Re-weighted correlation matrix: clustering๋œ ๊ฒƒ์ฒ˜๋Ÿผ over-smoothing ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— 0์— ์ผ์ •ํ•œ ๊ฐ’ ๋ถ€์—ฌ
        • h(โ‹…): non-linear operator (LeakyReLU)
      • ์ถœ๋ ฅ: (C ร— 2048) inter dependent object classifier
    • ์ตœ์ข…
      • Dot product, predicted score, sigmoid, multi-label classification loss


[์ „์ฒด ํ”„๋ ˆ์ž„์›Œํฌ]

picture


Experimental results

  • Dataset
    • MS-COCO, VOC2017
  • Ablation studies
    • Word embedding ์ข…๋ฅ˜
    • ์ž„๊ณ„๊ฐ’ ๋ณ€ํ™”
    • Re-weighted A์˜ ์ผ์ •ํ•œ ๊ฐ’ ๋ณ€ํ™”
    • GCN์˜ layer ์ˆ˜
  • ์ถ”๊ฐ€ ์‹คํ—˜
    • Vanilla ResNet๊ณผ ML-GCN์˜ class๋ณ„ t-SNE๋น„๊ต
    • Image retrieval ๋ถ„์•ผ ๊ด€์ ์—์„œ์˜ ์‹คํ—˜

References

[1] Chen, Zhao-Min, et al. โ€œMulti-Label Image Recognition with Graph Convolutional Networks.โ€ Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

[2] Pennington, Jeffrey, Richard Socher, and Christopher Manning. โ€œGlove: Global vectors for word representation.โ€ Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.