250x250

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

My Vision, Computer Vision

[논문 요약/리뷰] SPICE: Semantic Propositional Image Caption Evaluation 본문

Paper

[논문 요약/리뷰] SPICE: Semantic Propositional Image Caption Evaluation

gyuilLim 2025. 2. 28. 13:44

SPICE: Semantic Propositional Image Caption Evaluation

There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the ta

arxiv.org

Journal : ECCV 2016
Published Date : 2016년 9월 16일
keyword : Evaluation Metric, SPICE score

Problem

MS-COCO 같은 캡셔닝 분야 벤치마크 데이터셋은 빠르고, 정확한 평가 지표가 필요하다.
하지만 기존 Metric들은 사람의 평가를 대체하기에 부족하다.
BLEU, ROUGE, CIDEr, METEOR 등은 N-gram 오버랩에 과도하게 민감하다.
하지만 N-gram은 두 문장이 같은 의미를 나타내는지 판단할 때 필수적이지 않은 지표다.

예를 들어 아래 두 문장을 보자.
- (a) : A young girl standing on top of a tennis court.
- (b) : A giraffe standing on top of a green filed.
위 두 문장은 의미가 다르지만 “standing on top of a”가 동일하기 때문에, N-gram 기반 메트릭은 높은 점수를 준다.
하지만 반대로 아래의 경우,
- (c) : A shiny metal pot filled with some diced veggies.
- (d) : The pan on the stove has chopped vegetables in it.
두 문장은 같은 의미를 전달하지만, N-gram 오버랩 기준으로 평가하면 낮은 점수를 받는다.
본 연구에서는 N-gram이 아닌, Semantic propositional content가 중요할 것이라는 가설을 세우고 접근한다.
즉 “A young girl standing on top of a tennis court”에서 사람은 “사실”에 초점을 맞추는 경향이 있다는 것이다.
여기서 “사실”이란 아래와 같다.
- (1) : there is a girl.
- (2) : girl is young.
- (3) : girl is standing.
- (4) : there is a court.
- (5) : court is tennis.
- (6) : girl is on top of court.

이 아이디어에서, 이미지로부터 Scene graph를 만들어낸다.

Contributions

SPICE(Semantic Propositional Image Caption Evaluation)을 제안한다.
BLEU, METEOR, ROUGE-L, CIDEr 보다 Human evaluation과 비슷하다.
SPICE 결과를 분해하여, “어떤 모델이 색을 더 잘 이해하는가?” 같은 더 세밀한 정보를 분석할 수 있다.

Methods

SPICE Metric

목표는 후보 캡션(Cadidate, $c$) 와 참조 캡션(Reference, $S={s_1, \cdots, s_m }$) 의 유사도를 측정하는 것이다.
이를 위해 후보 캡션의 Scene graph($G(c)$) 와 참조 캡션의 Scene graph($G(S)$) 를 만들어낸다.
이 때 $G(S)$ 는 각 후보 캡션 $s_i\in S$ 에 대한 Scene graph $G(s_i)$의 합집합이다.

Semantic Parsing

캡션을 Scene graph로 표현하기 위해, 캡션으로부터 <객체, 관계, 속성>을 파싱한다.

$$ G(c)= \langle O(c), E(c), K(c) \rangle $$

Scene graph($G(c)$)는 캡션에서 객체, 관계, 속성을 파싱하여 집합으로 나타낸 것이다.

Scene Graph → Logical Tuple

이제 캡션으로부터 만들어진 Scene graph로 유사도를 측정하기 위해 Logical tuple로 변환한다.
튜플로 변환하는 함수를 $T$ 일 때, 아래와 같다.

$$ T(G(c))\triangleq O(c) \cup E(c) \cup K(c) $$

튜플은 객체, 속성, 관계 요소를 1개부터 3개로 구성된다.
$O(c)$ 는 객체, $E(c)$ 는 객체와 객체간 Hyper-edge, $K(c)$ 는 객체의 속성을 의미한다.

예를 들어 “A young girl standing on top of a tennis court”의 Logical tuple은 { (girl), (court), (girl, young), (girl, standing), (court, tennis), (girl, on-top-of, court) }이다.
- Object : girl, court
- Relations : on top of
- Attributes : young, standing, tennis

F-score Calculation

최종적으로 SPICE score는 Candidate의 Logical tuple과 Reference의 Logical tuple의 일치율을 토대로 Precision, Recall, F-score를 계산한다.

$$ P(c,S)=\frac{|T(G(c)) \otimes T(G(S))|}{|T(G(c))|} \\ R(c,S)=\frac{|T(G(c)) \otimes T(G(S))|}{|T(G(S))|} \\ SPICE(c,S) = F_1(c,S)=\frac{2 \cdot P(c,S) \cdot R(c,S)}{P(c,S) + R(c,S)} $$

Experiment

M1, M2는 인간 캡션과 유사하고, 튜링 테스트를 통과한 캡션으로 품질이 좋은 캡션들에 대한 평가이다.
M3~5는 정확성, 디테일과 관련된 캡션에 대한 평가이다.
모두 SPICE가 높다.

캡셔닝 모델 및 사람의 수치를 2015 COCO 데이터셋으로 측정한 것이다.
SPICE 및 Object, Relation, Attribute, Color, Count, Size를 잘 출력하는지를 측정한 것이다.

728x90

'Paper' 카테고리의 다른 글

[논문 요약/리뷰] FuseMix: Data-Efficient Multimodal Fusion on a Single GPU (0)	2025.03.07
[논문 리뷰/요약] TinyLLaVA: A Framework of Small-scale Large Multimodal Models (0)	2025.03.04
[논문 요약/리뷰] CIDEr: Consensus-based Image Description Evaluation (0)	2025.02.27
[논문 요약/리뷰] BLEU: a Method for Automatic Evaluation of Machine Translation (0)	2025.02.25
[논문 요약/리뷰] CoOp : Learning to Prompt for Vision-Language Models (0)	2025.02.24

'Paper' Related Articles