Notice

Recent Posts

Recent Comments

Link

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Tags more

Archives

Today

Total

관리 메뉴

My Vision, Computer Vision

[논문 요약/리뷰] LAVT: Language-Aware Vision Transformer for Referring Image Segmentation 본문

Paper

[논문 요약/리뷰] LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

gyuilLim 2025. 5. 2. 18:05

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring expression for highligh

arxiv.org

Author : Yang, Zhao, et al
Journal : CVPR 2022
Keyword : PWAM, Texture bias
Published Date : 2021년 12월 4일

Problem

기존 RES(Referring Expression Segmentation)의 네트워크들은 이미지, 텍스트 인코더를 거친 후의 피쳐들을 디코더에서 어텐션 및 퓨전하는 방식을 주로 사용했다.
이러한 방법들에는, 디코더에서 두 모달의 피쳐를 융합 하더라도 인코딩된 피쳐에 종속적인 문제가 있다.
따라서 이 논문에서는 인코딩 중간의 피쳐에 어텐션을 적용하여 Cross-Modal Alignment 능력을 대폭 향상시킨다.

Contributions

트랜스포머 기반 RES 네트워크 LAVT(Language Aware Vision Transformer)를 제안한다.
RES 벤치마크에서 SOTA를 달성한다.

Method

Langauge-aware visual encoding

이미지 인코더로는 Swin Transformer를 사용하고 $V_1, V_2, V_3$ 세 개의 피쳐맵 단계에서 텍스트 피쳐 $L$과 어텐션을 수행한다.
피쳐맵과 텍스트 피쳐는 PWAM(Pixel-Word Attention Module)이라는 Multi-modal feature fusion 모듈에서 계산된다.
PWAM에서 계산된 출력은 LG(Language Gate)를 거쳐 가중치화 되고 피쳐맵 $V_i$에 Element-wise로 더해진다.
논문에서는 가중치가 피쳐맵에 더해지는 부분을 LP(Language Pathway)라고 부른다.

Pixel-word attention module

PWAM은 이미지 피쳐맵과 텍스트 피쳐를 입력으로 받아 어텐션 연산을 수행한 후, 출력으로 내보낸다.
$\omega_{iq}, \omega_{ik}, \omega_{iv}, \omega_{iw}$ 는 모두 Projection Layer이다.
먼저 $V_i$ 는 $\omega_{iq}$ 를 거쳐 크로스 어텐션의 입력 쿼리인 $V_{iq}$ 로 만들어진다.
$L$은 각각 $\omega_{ik}, \omega_{iv}$ 를 거쳐 크로스 어텐션의 입력 키, 밸류 $L_{ik}, L_{iv}$ 로 만들어진다.
쿼리와 키는 행렬 곱, Softmax를 거쳐 어텐션 맵이 되고, 밸류와 곱해져 Image-Langauge Attention $G_i'$ 이 만들어진다.
$G'i$ 는 $\omega{iw}$ 를 거쳐 최종적인 Cross Attention 출력인 $G_i$ 가 된다.

마지막으로 피쳐맵이 $\omega_{im}$ 을 거쳐 $V_{im}$ 가 되고, $G_i$ 와 Element-wise로 곱해지고 $\omega_{i0}$ 를 거쳐 $F_i$ 로 출력된다.
실제 코드는 아래와 같다.

    def forward(self, x, l, l_mask):
        # input x shape: (B, H*W, dim)
        vis = self.vis_project(x.permute(0, 2, 1))  # (B, dim, H*W)

        lang = self.image_lang_att(x, l, l_mask)  # (B, H*W, dim)

        lang = lang.permute(0, 2, 1)  # (B, dim, H*W)

        mm = torch.mul(vis, lang)
        mm = self.project_mm(mm)  # (B, dim, H*W)

        mm = mm.permute(0, 2, 1)  # (B, H*W, dim)

        return mm

Langauge pathway

Language pathway는 $F_i$ 가 시각 신호를 너무 덮어버리지 않도록 하는 역할을 한다.

$\gamma_i$는 두 개의 레이어로 이루어진 퍼셉트론이다.
퍼셉트론을 거친 $S_i$와 $F_i$를 Element-wise 곱셈 후 피쳐맵 $V_i$에 더한다.
최종적으로 만들어진 $E_i$가 다음 레이어의 입력으로 들어가는 것이다.

Segmentation

디코딩 단계에서는 피쳐맵을 업샘플링하여 Segmentation mask를 생성한다.
$v$는 Bilinear Interpolation을 사용한 Upsampling 연산이다.
$\rho_i$는 Projection Function으로, 3*3 컨볼루션 레이어 2개로 이루어져있다.

Experiment

RES 벤치마크인 RefCOCO, RefCOCO+, G-Ref 모두 성능이 상당히 증가했다.

'Paper' 카테고리의 다른 글

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (1)	2025.05.16
[논문 요약/리뷰] Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation (0)	2025.05.02
[논문 요약/리뷰] A Survey on Hallucination in Large Vision-Language Models (1)	2025.04.24
[논문 리뷰/요약] GRES: Generalized Referring Expression Segmentation (0)	2025.04.16
[논문 요약/리뷰] DINOv2: Learning Robust Visual Features without Supervision (0)	2025.03.31

'Paper' Related Articles