Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Our Findings

Based on the multi-task training framework that is required to achieve one-to-one and one-many prediction, any independent component significantly benefits the primary route of one-to-one prediction, even when other components are shared.

No	Configurations	Routes	o2o	o2m w/ NMS
(1)	One-to-one only	1	47.6	-
(2)	Share All	1	41.6 (-6.0)	41.6
(3)	Not shared Self-Attention	2	49.7 (+2.1)	50.3
(4)	Not shared Cross-Attention	2	49.2 (+1.6)	50.0
(5)	Not shared FFN	2	49.6 (+2.0)	50.1
(6)	Shared Self-Attention	2	49.4 (+1.8)	50.3
(7)	Shared Cross-Attention	2	49.4 (+1.8)	50.0
(8)	Shared FFN	2	49.2 (+1.6)	50.0
(9)	(3) + (4)	3	49.4 (+1.8)	49.9
(10)	(3) + (5)	3	50.0 (+2.4)	50.8
(11)	(4) + (5)	3	49.0 (+1.4)	49.6
(12)	(3) + (4) + (5)	4	49.6 (+2.0)	50.2

Configurations

Routes

o2o

o2m w/ NMS

(1)

One-to-one only

47.6

(2)

Share All

41.6 (-6.0)

41.6

(3)

Not shared Self-Attention

49.7 (+2.1)

50.3

(4)

Not shared Cross-Attention

49.2 (+1.6)

50.0

(5)

Not shared FFN

49.6 (+2.0)

50.1

(6)

Shared Self-Attention

49.4 (+1.8)

50.3

(7)

Shared Cross-Attention

49.4 (+1.8)

50.0

(8)

Shared FFN

49.2 (+1.6)

50.0

(9)

(3) + (4)

49.4 (+1.8)

49.9

(10)

(3) + (5)

50.0 (+2.4)

50.8

(11)

(4) + (5)

49.0 (+1.4)

49.6

(12)

(3) + (4) + (5)

49.6 (+2.0)

50.2

Method

Our method includes three training routes: Route-1, Route-2, and Route-3. All three routes share the same object queries and detection heads for classification and regression. Route-2 serves as the primary route for one-to-one prediction, identical to the baseline models. Route-1 shares self-attention and cross-attention but uses an independent feed-forward network (o2m FFN) for one-to-many prediction. Route-3, sharing all components with the primary route, introduces a novel instructive self-attention, implemented by adding a learnable instruction token to the object queries to guide them and the subsequent network for one-to-many prediction. During inference, the auxiliary routes, Route-1 and Route-3, are discarded.

Quantatitve Results

Extension to Instance Segmentation

Epochs	w/ Mr. DETR	Mask mAP	Box mAP
12		32.4	46.5
12	✔	36.0 (+3.6)	49.5 (+3.0)
24		35.1	48.6
24	✔	37.6 (+2.5)	50.3 (+1.7)

Instance segmentation results on the COCO 2017 validation set. All experiments are based on the Deformable-DETR++ with 300 queries and ResNet-50 as backbone.

Effectiveness of our Instructive Self-Attention

Route-1	Route-2	Route-3	mAP	AP₅₀	AP₇₅
✔			47.6	65.8	51.8
	✔		49.6 (+2.0)	67.4	54.2
		✔	50.4 (+2.8)	67.9	55.3
✔	✔	✔	50.7 (+3.1)	68.2	55.4

The ablation study of different routes in our method. 'Route-1': the auxiliary training route with independent FFN. 'Route-2': the primary route for one-to-one prediction. 'Route-3': the auxiliary training route with instructive self-attention.

We further visualize the attention maps of the instructive self-attention, which reveals that when the 300 object queries act as query and the 10 instruction tokens as key, nearly all 300 object queries exhibit strong activation with the instruction tokens. This indicates that instruction tokens effectively convey information to object queries and subsequent network layers, aiding the model in achieving one-to-many predictions.

@inproceedings{zhang2024mr, title={Mr. DETR: Instructive Multi-Route Training for Detection Transformers}, author={Zhang, Chang-Bin and Zhong, Yujie and Han, Kai}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2025} }

Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Performance

Qualitative Results

Our Findings

Method

Quantatitve Results

Extension to Instance Segmentation

Effectiveness of our Instructive Self-Attention

BibTeX