Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text

Abstract

This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalence of images. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state. Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning.

To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation. Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method.

Our contributions are listed as follows:

We introduce the novel panoptic captioning task, which strives to seek the minimum text equivalence of an image -- an ambitious yet challenging goal. We formulate it as the task of generating a comprehensive textual description composed of five distinct dimensions, and contribute a comprehensive PancapScore metric for reliable evaluation.
We propose an effective data engine named PancapEngine to produce high-quality data. We also contribute the SA-Pancap benchmark for model training and evaluation, which includes a high-quality validation set and a human-curated test set for reliable evaluation.
We propose a simple yet effective method named PancapChain to improve panoptic captioning, which decouples the challenging panoptic captioning task into multiple subtasks. Extensive experiments demonstrate the effectiveness and value of our task and model.

Background and Conception

Representing images by textual descriptions is a fundamental topic in the fields computer vision and natural language processing, which benifits various applications, e.g., cross-modal retrieval, multi-modal learning, safe content generation. While prior works have explored various image caption formats, identifying the most effective format remains an open challenge. The most concise captions, which describe only primary entity categories, often sacrifice critical details like entity attributes. Conversely, highly detailed representations, such as paragraphs detailing all pixel-level semantics and their interrelations, are computationally burdensome due to their length.

Inspired by these considerations, this work conceives of finding the minimum text equivalence of an image, an ambitious yet challenging goal, which aims to develop a concise textual description that comprehensively captures its essential semantic elements. Conceptually, achieving minimal text equivalence for images can be seen as aligning images and text in the data space, while existing image-text alignment models like CLIP perform this in the embedding space. Such text representations would maximize the utility of image information for learning and downstream applications.

Task Formulation of Panoptic Captioning

This work introduces the task of panoptic captioning, which strives to seek the minimum text equivalence of images. Our work serves as the initial effort towards this challenging task. To make the problem tractable, we formulate panoptic captioning as the task of generating a comprehensive textual description for an image, which encapsulates all entity instances, their respective locations and attributes, relationships among instances, as well as global image state.

Semantic tag refers to the category label assigned to each entity instance in an image. Panoptic captioning requires identifying all entity instances and assigning category label to each instance.
Location refers to the spatial positions of entity instances, which are represented in terms of bounding boxes. By introducing bounding boxes, panoptic captions can more accurately describe the locations and occupied regions of entity instances, which also helps distinguishing entity instances with similar attributes more easily.
Attribute refers to characteristics or properties that describe an entity instance's appearance, state or quality. The attribute dimension encompasses a wide range of semantic content types, e.g., color, shape, material, texture, type, text rendering.
Relation refers to connections or interactions between different entity instances within an image. The relation dimension encompasses a wide range of semantic content types, such as position relation (e.g., A is behind B), part-whole relation (e.g., A is a part of B) and action relation (e.g., A kicks B).
Global image state refers to the overall characteristics of an image that provide a holistic understanding of its content, without focusing on specific entity instances within the image.

Evaluation Metric: PancapScore

The figure above demonstrates an overview of our proposed PancapScore metric. PancapScore first extracts semantic content from captions, and then evaluates model performance by entity instance matching and instance-aware question answering (QA). Existing captioning metrics cannot effectively evaluate model performance in panoptic captioning, due to the fundamental formulation differences between existing captioning task and our panoptic captioning task.

Data Engine and Benchmark

To address this new and challenging task, our work proposes an effective data engine named PancapEngine to produce high-quality data. Our PancapEngine first detects diverse categories of entities in images using an elaborate entity detection suite. We then employ state-of-the-art MLLMs to generate comprehensive panoptic captions using entity-aware prompts, ensuring the data quality by caption consistency across different MLLMs.

Based on our PancapEngine, we contribute a new SA-Pancap benchmark for the panoptic captioning task. We select SA-1B as the data source due to its high image quality and data diversity. Overall, our SA-Pancap benchmark consists of 9,000 training and 500 validation images paired with auto-generated panoptic captions, and 130 test images paired with human-curated panoptic captions.

The Proposed Model: PancapChain

The figure above demonstrates an overview of our proposed PancapChain method. The key idea is to decouple the challenging panoptic captioning task into multiple stages and train the model to generate panoptic captions step by step. Specifically, PancapChain explicitly decouples the task into four stages, namely entity instance localization, semantic tag assignment, extra instance discovery and panoptic caption generation.

Leaderboard on the SA-Pancap Benchmark

(Note: Model names with the suffix "-Tuned" denote models tuned on the training set of SA-Pancap)

Model	Validation Set						Test Set
Model	Entity	Location	Attribute	Relation	Global	Overall	Entity	Location	Attribute	Relation	Global	Overall
Molmo-72B	52.06	10.03	36.88	25.90	76.78	132.53	50.92	14.00	38.10	38.10	68.49	130.55
LLaVA-OneVision-72B	54.20	13.79	38.94	27.80	85.52	143.28	53.62	15.16	41.52	25.63	82.39	144.17
Qwen2-VL-72B	49.85	12.92	37.83	24.71	86.30	133.96	48.19	12.90	38.48	20.44	84.13	128.42
Qwen2.5-VL-72B	54.08	19.70	40.00	27.24	85.34	149.54	54.42	25.11	42.33	26.32	87.12	156.89
NVLM-72B	54.69	10.78	42.49	30.40	86.21	146.97	57.79	11.53	46.48	29.48	78.60	153.14
InternVL-2.5-78B	54.68	15.05	41.81	27.41	88.37	147.79	55.90	18.26	43.63	28.72	81.46	154.66
Llama-3.2-90B	52.87	20.73	39.94	27.09	83.40	148.98	51.64	21.88	40.55	25.33	79.55	79.55
GPT-4o	50.89	10.12	40.54	25.40	88.85	135.83	53.51	14.55	43.86	27.38	87.08	148.01
Gemini-2.0-Pro	53.79	16.66	43.14	28.52	86.50	150.75	53.89	21.59	45.62	27.99	87.91	157.88
LLaVA-1.5-13B-Tuned	54.92	27.76	41.27	28.69	81.94	161.84	54.33	30.57	41.81	30.62	75.73	164.92
ShareGPT4V-13B-Tuned	55.02	23.81	40.53	29.13	82.16	156.70	52.94	25.56	39.56	25.11	80.36	151.21
PancapChain-13B (Ours)	57.56	30.34	44.78	34.61	84.59	175.75	56.45	31.76	44.46	32.54	79.85	173.19

An Application Example: Image-Text Retrieval

We apply our model to the downstream image-text retrieval task to demonstrate the application potential of our task and model. Specifically, to perform image-text retrieval, we first employ image captioners to generate the description for a given query image, followed by retrieving similar descriptions using the NV-Embed-v2 text embedding model. As shown in table below, on the challenging DOCCI dataset, our PancapChain can achieve comparable performance with the state-of-the-art MATE model, despite using no image-text retrieval training data or specialized module designs. Our PancapChain also outperforms state-of-the-art image captioners (e.g., ShareGPT4V), demonstrating its effectiveness in capturing image details. In addition, using the Capture metrics, we demonstrate that PancapChain retrieves descriptions from the text corpus that are more semantically aligned with ground-truth descriptions, excelling on the dimensions of object, attribute, and relation.

Image "Reconstruction" from Captions

We conduct an image reconstruction experiment by associating captioners with text-to-image generation models. This experiment serves as a proxy for evaluating the completeness of image descriptions, i.e., if a caption captures all essential visual elements, a text-to-image model would be able to reconstruct an image similar to the original one. Based on a generated caption for an input image, we adopt the text-to-image generation model PixArt-∑ to generate a new image. As shown in the above figure, PixArt-∑ associated with our PancapChain model can generate more similar images to the original images, compared with other baseline models.