[논문 리뷰/요약] DETR, End-to-End Object Detection with Transformers

250x250

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

My Vision, Computer Vision

[논문 리뷰/요약] DETR, End-to-End Object Detection with Transformers 본문

Paper

[논문 리뷰/요약] DETR, End-to-End Object Detection with Transformers

gyuilLim 2024. 9. 11. 20:13

End-to-End Object Detection with Transformers

We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor gene

arxiv.org

Abstract

DETR은 Object detection을 Direct set prediction problem으로 본다.
또한 NMS, Anchor generation과 같은 Hand-designed 요소를 제거하여 Pipeline을 간소화(direct)한다.

NMS(Non-maximum Suppression) : 하나의 객체에 대해 예측된 여러개의 Bounding box 중 중복되는 것들을 제거하는 알고리즘

Anchor generation : Anchor와 같이 사전에 정의되는 Greed box로, One-stage detector에서 사용

Introduction

기존 Object detector는 위에서 언급한 Postprocessing(NMS, anchor generation) 단계에 영향을 많이 받지만 DETR은 Postprocessing 단계를 없애고 End-to-end로 학습된다.
DETR은 NLP의 Transformer를 변형한 구조인데, Main contribution은 아래와 같다.
- Bipartite matching loss : 예측값(predection)이 정답값(ground-truth)에 이분 매칭(bipartite)된 상태에서 loss 계산
- (Non-autoregressive) Transformer encoder-decoder : Non-autoregressive 구조로된 Decoder를 사용하여 예측값을 병렬로 출력한다.(parallel decoding)

Bipartite match graph(이분 매칭 그래프) : A집합에서 B집합으로 간선을 연결할 때, A의 요소 하나 당 간선이 하나인 그래프.

Non-autoregressive : 일반적인 RNN, Transformer는 Auto-regressive로, 예측값을 순차적으로 출력하지만 Non-autoregressive는 한 번에 여러개의 예측값을 출력한다. 이는 Object detection task와 이론적으로 상통하며, 연산량 측면에서도 효율적이다.

Object Detection Set Prediction Loss

DETR은 하나의 이미지에 대해 N개의 고정된 예측값을 출력한다. 이 때, N은 이미지 안에 있는 객체의 수보다 크다고 가정한다.
DETR의 손실함수는 다음과 같은 순서로 계산된다.
- 1. Prediction과 Ground-truth 매칭시키기(bipartite matching)
- 2. 매칭된 Prediction과 Ground-truth의 Loss 계산
각 단계에 대해 수식으로 살펴보자.

1. Prediction과 Ground-truth 매칭시키기

$N$ : 고정된 길이의 Prediction의 개수이며, 나머지는 $\varnothing$(no object)로 padding된다.
$y$ : Ground-truth 집합
$\hat y = \{\hat y_i\}^N_{i=1}$ : N개의 Prediction 집합
$\mathcal L_{match}(y_i, \hat y_{\sigma(i)})$ : 한 쌍의 Prediction과 Ground-truth의 matching cost를 구하는 함수
즉 $\hat \sigma$ 는 i번째 GT에 대응되는 Prediction의 index값을 의미한다.

2. 매칭된 Prediction과 Ground-truth Loss 계산

Hungarian 알고리즘은 주로 Bipartite graph에서 최적의 매칭을 사용되는 알고리즘이다.
Prediction과 Ground-truth의 최적의 매칭을 찾은 후 Loss를 계산한다.
Loss는 $\hat \sigma(i)$ 예측값이 $c_i$ 일 확률에 log를 취한 값 + Box loss 이다.
Box loss는 Prediction box와 GT box 사이의 L1 loss와 IoU loss의 합으로 계산된다.

728x90

'Paper' 카테고리의 다른 글

[논문 리뷰/요약] VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning (1)	2024.11.21
[논문 리뷰/요약] REDQT: a method for automatedmobile application GUI testing basedon deep reinforcement learning algorithms (0)	2024.10.16
[논문 리뷰/요약] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (0)	2024.08.27
[논문 리뷰/요약] Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs (0)	2024.04.26
[논문 리뷰/요약]MobileNetV2: Inverted Residuals and Linear Bottlenecks (0)	2024.04.25

'Paper' Related Articles

My Vision, Computer Vision

[논문 리뷰/요약] DETR, End-to-End Object Detection with Transformers 본문

[논문 리뷰/요약] DETR, End-to-End Object Detection with Transformers

Abstract

Introduction

Object Detection Set Prediction Loss

1. Prediction과 Ground-truth 매칭시키기

2. 매칭된 Prediction과 Ground-truth Loss 계산

'Paper' 카테고리의 다른 글

티스토리툴바