[2019 CVPR] Multi-Label Image Recognition with Graph Convolutional Networks

Table of content (full-version) [paper] [github]

Summary
Experimental results
References

Summary

Multi-label image recognition 분야 (영상마다 다수의 label 존재)
- Object는 서로 복잡한 topology를 가지고 있어, lable dependency를 모델링하는 것이 중요한 분야 $\rightarrow$ GCN

[Multi-label image recognition 개념]

전체 프레임워크
- Representation learning
  - 입력: (448 $\times$ 448) 영상
  - 모듈: ResNet101 에 의해서 (2048 $\times$ 14 $\times$ 14) feature vector (ImageNet pretrained), GAP 적용
  - 출력: 2048-dim feature vector
- Graph convolutional network
  - 입력: (C $\times$ 300) word embedding features (pretrained, GLoVe [2])
  - 모듈: GCN 2개 (1024, 2048 dimension)
    - 수식: $H^2 = h(\hat{A} H^1 W^1)$ , $H^3 = h(\hat{A} H^2 W^2)$
    - $H$ : learnable transformation network
    - : correlation matrix ( normalized)
      - Data-driven way: 학습 셋에 있는 label pair를 이용, 확률로 변환
      - Assymetric: 영상에 사람이 있을 때 테니스 라켓까지 포함되는 것이, 테니스 라켓있을 때 사람이 포함될 확률보다 적다.
      - Binary correlation matrix: 희귀한 label pair는 오히려 noise가 될 수 있기에 임계값을 통한 (0,1) 이산화
      - Re-weighted correlation matrix: clustering된 것처럼 over-smoothing 될 수 있기 때문에 0에 일정한 값 부여
    - $h(\cdot)$ : non-linear operator (LeakyReLU)
  - 출력: (C $\times$ 2048) inter dependent object classifier
- 최종
  - Dot product, predicted score, sigmoid, multi-label classification loss

[전체 프레임워크]

Experimental results

Dataset
- MS-COCO, VOC2017
Ablation studies
- Word embedding 종류
- 임계값 변화
- Re-weighted A의 일정한 값 변화
- GCN의 layer 수
추가 실험
- Vanilla ResNet과 ML-GCN의 class별 t-SNE비교
- Image retrieval 분야 관점에서의 실험

References

[1] Chen, Zhao-Min, et al. “Multi-Label Image Recognition with Graph Convolutional Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

[2] Pennington, Jeffrey, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation.” Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.