Notice

Recent Posts

Recent Comments

Link

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Tags more

Archives

Today

Total

관리 메뉴

My Vision, Computer Vision

[논문 요약/리뷰] Vision Transformers for Dense Prediction 본문

Paper

[논문 요약/리뷰] Vision Transformers for Dense Prediction

gyuilLim 2025. 5. 19. 19:11

Vision Transformers for Dense Prediction

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like represe

arxiv.org

Author : Ranftl, R., Bochkovskiy, A., & Koltun, V.
Journal : ICCV 2021
Keyword : DPR
Published Date : 2021년 3월 24일

Problem Statement

트랜스포머 기반 백본 네트워크는 크기가 동일하고 상대적으로 큰 해상도의 표현을 생성한다.
이러한 특성은 Dense prediction transformer를 컨볼루션 네트워크에 비해 더 세밀하고 전역적인 예측을 하도록 한다.
기존 연구들은 보통 디코더에서의 Feature Aggregation에 초점을 맞추고 있지만, 사실상 디코더에서는 인코더에서 손실된 표현을 복구하지는 못하기 때문에 인코더의 영향력이 더 크다고 할 수 있다.
컨볼루션 네트워크는 입력 이미지로부터 Multi-scale의 피쳐를 추출한다. 이 때 다운 샘플링은 다음과 같은 역할을 한다.
1. Receptive field를 증가시킨다.
2. Low-level 피쳐를 추상적인 High-level로 그루핑한다.
3. 메모리와 연산량을 감소시킨다.
하지만 다운샘플링은 피쳐맵의 해상도와 세밀함이 층을 거칠수록 손실된다는 치명적인 단점이 있다.
피쳐맵의 해상도와 세밀함은 Classification 등의 작업에서는 중요하지 않을 수 있지만, 세그멘테이션같은 Dense Prediction 작업에서는 입력 이미지와 동일한 해상도로 피쳐맵을 복원해야하기 때문에 매우 중요하다.

Transformer Encoder

ViT에서 임베딩 패치와 토큰은 1대1 대응이기 때문에, 모든 레이어에서 초기 임베딩의 해상도가 유지된다.
또한 MSA는 모든 토큰들이 서로 Attention되는 전역 연산이므로, 트랜스포머 기반 모델은 Global Receptive Field를 갖게 된다. 이는 Receptive Field가 점진적으로 넓어지는 CNN과는 대조적인 방식이다.
트랜스포머는 본질적으로 Set-to-Set 연산이기 때문에, 위치 정보를 보존하지 않는다. 따라서 Position Embedding을 추가하여 위치 정보를 표현한다.
또한 ViT는 Readout 토큰이 있는데, 이미지에 대응되지 않는(패치 토큰이 아닌) 추가적인 학습 가능한 토큰을 말하며, Classification에 사용한다.
$H \times W$ 크기의 이미지에 임베딩을 적용하면 토큰 집합 $t^0={t^0_0, \dots , t^0_{N_p}}, t^0_n \in \mathbb R^D$ 이 생성된다. 각 토큰은 $D$ 차원을 갖고, 패치의 개수는 $N_p = \frac{HW}{P^2}$ 이다. $t_0$ 는 readout 토큰이다.
토큰 집합은 $L$ 개의 레이어를 거치며 $l$ 번째 레이어의 출력 $t^l$ 로 변환된다.

Convolutional Decoder

디코더는 트랜스포머 인코더로부터 생성된 토큰 집합을 이미지 형태(Image-like) 특징맵으로 재구성한 후 최종적인 밀집 예측(Dense Prediction)을 생성한다.
논문에서는 디코딩 단계에서 이미지 형태를 복원하기 위한 3단계 Reassemble 연산을 제안하는데, 이는 아래와 같다.

$$ \textrm{Reassemble}^{\hat D}_s(t) = (\textrm{Resample}_s \circ \textrm{Concatenate} \circ \textrm {Read})(t) $$

위와 같이 임베딩 토큰을 Read, Concatanate, Resample 과정을 거쳐 이미지 형태로 복원한다.

Reassemble

먼저 Read 단계는 Readout 토큰을 처리하는 단계이다. 토큰 집합을 이미지 형태로 Reshape 해야하기 때문에, $N_p+1$ 개의 토큰을 $N_p$ 로 매핑해야한다.

$$ \textrm{Read} : \mathbb R^{N_p+1\times D} \rightarrow \mathbb R^{N_p\times D} $$

논문에서는 위와 같은 매핑을 위해 세가지 방법을 실험한다.
- Readignore : Readout 토큰을 사용하지 않음.
- Readadd : Readout 토큰을 다른 모든 토큰에 덧셈.
- Readproj : 각 토큰에 Readout 토큰을 Concatanation한 후 Projection Layer를 거쳐 다시 D차원으로 되돌림.
Read 과정을 거치면 토큰이 $N_p$ 개로 매핑되기 때문에 다시 $\frac{H}{p} \times \frac{W}{p}$ 크기로 Reshape 할 수 있다. 이 단계가 Concatanation이다.

$$ \textrm{Concatanation} : \mathbb R^{N_p \times D} \rightarrow \mathbb R^{\frac{H}{p} \times \frac{W}{p}\times D} $$

이제 마지막으로 Resampling 과정을 거쳐, 피쳐맵의 크기와 차원을 변환한다.

$$ \textrm{Resample}_s : \mathbb R^{\frac{H}{p}\times \frac{W}{p}\times D} \rightarrow \mathbb R^{\frac{H}{s}\times \frac{W}{s}\times \hat D} $$

$s$ 는 출력 이미지 크기에 대한 비율이고 $\hat D$ 는 출력 피쳐맵의 차원이다.
ViT 종류에 따라 각각 다른 단계의 피쳐들을 사용한다.
- ViT-Large : $l={6, 12, 18, 24 }$
- ViT-Base : $l = {3, 6, 9, 12 }$

'Paper' 카테고리의 다른 글

[논문 요약/리뷰] GSVA: Generalized Segmentation via Multimodal Large Language Models (0)	2025.06.18
[논문 요약/리뷰] Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation (2)	2025.05.28
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (1)	2025.05.16
[논문 요약/리뷰] Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation (0)	2025.05.02
[논문 요약/리뷰] LAVT: Language-Aware Vision Transformer for Referring Image Segmentation (0)	2025.05.02

'Paper' Related Articles