Mr. DETR: Instructive Multi-Route Training for Detection Transformers

1Visual AI Lab, The University of Hong Kong
2Meituan Inc.

Performance

Improvement across baselines
Improvement across baselines
Convergency curve
Convergency curve

Qualitative Results

Our Findings

Based on the multi-task training framework that is required to achieve one-to-one and one-many prediction, any independent component significantly benefits the primary route of one-to-one prediction, even when other components are shared.

Description of image

No Configurations Routes o2o o2m w/ NMS
(1) One-to-one only 1 47.6 -
(2) Share All 1 41.6 (-6.0) 41.6
(3) Not shared Self-Attention 2 49.7 (+2.1) 50.3
(4) Not shared Cross-Attention 2 49.2 (+1.6) 50.0
(5) Not shared FFN 2 49.6 (+2.0) 50.1
(6) Shared Self-Attention 2 49.4 (+1.8) 50.3
(7) Shared Cross-Attention 2 49.4 (+1.8) 50.0
(8) Shared FFN 2 49.2 (+1.6) 50.0
(9) (3) + (4) 3 49.4 (+1.8) 49.9
(10) (3) + (5) 3 50.0 (+2.4) 50.8
(11) (4) + (5) 3 49.0 (+1.4) 49.6
(12) (3) + (4) + (5) 4 49.6 (+2.0) 50.2

Method

Our method includes three training routes: Route-1, Route-2, and Route-3. All three routes share the same object queries and detection heads for classification and regression. Route-2 serves as the primary route for one-to-one prediction, identical to the baseline models. Route-1 shares self-attention and cross-attention but uses an independent feed-forward network (o2m FFN) for one-to-many prediction. Route-3, sharing all components with the primary route, introduces a novel instructive self-attention, implemented by adding a learnable instruction token to the object queries to guide them and the subsequent network for one-to-many prediction. During inference, the auxiliary routes, Route-1 and Route-3, are discarded.

Description of image
Illustration of our proposed multi-route training method

Quantatitve Results

Description of image
The performance on the COCO 2017 validation set.All models are based on the ResNet-50 backbone.
Description of image
The performance on the COCO 2017 validation set based on the Swin-L backbone.

Extension to Instance Segmentation

Epochs w/ Mr. DETR Mask mAP Box mAP
12 32.4 46.5
12 36.0 (+3.6) 49.5 (+3.0)
24 35.1 48.6
24 37.6 (+2.5) 50.3 (+1.7)

Instance segmentation results on the COCO 2017 validation set. All experiments are based on the Deformable-DETR++ with 300 queries and ResNet-50 as backbone.

Effectiveness of our Instructive Self-Attention

Route-1 Route-2 Route-3 mAP AP50 AP75
47.6 65.8 51.8
49.6 (+2.0) 67.4 54.2
50.4 (+2.8) 67.9 55.3
50.7 (+3.1) 68.2 55.4

The ablation study of different routes in our method. 'Route-1': the auxiliary training route with independent FFN. 'Route-2': the primary route for one-to-one prediction. 'Route-3': the auxiliary training route with instructive self-attention.

We further visualize the attention maps of the instructive self-attention, which reveals that when the 300 object queries act as query and the 10 instruction tokens as key, nearly all 300 object queries exhibit strong activation with the instruction tokens. This indicates that instruction tokens effectively convey information to object queries and subsequent network layers, aiding the model in achieving one-to-many predictions.

Description of image
Visualization of attention maps for instructive self-attention.We use Deformable-DETR++ with 300 object queries and 10 instruction tokens for this visualization. The first 10 tokens are instruction tokens. The vertical and horizontal axes represent the Query and Key, respectively.

BibTeX

@inproceedings{zhang2024mr,
  title={Mr. DETR: Instructive Multi-Route Training for Detection Transformers},
  author={Zhang, Chang-Bin and Zhong, Yujie and Han, Kai},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}