No | Configurations | Routes | o2o | o2m w/ NMS |
---|---|---|---|---|
(1) | One-to-one only | 1 | 47.6 | - |
(2) | Share All | 1 | 41.6 (-6.0) | 41.6 |
(3) | Not shared Self-Attention | 2 | 49.7 (+2.1) | 50.3 |
(4) | Not shared Cross-Attention | 2 | 49.2 (+1.6) | 50.0 |
(5) | Not shared FFN | 2 | 49.6 (+2.0) | 50.1 |
(6) | Shared Self-Attention | 2 | 49.4 (+1.8) | 50.3 |
(7) | Shared Cross-Attention | 2 | 49.4 (+1.8) | 50.0 |
(8) | Shared FFN | 2 | 49.2 (+1.6) | 50.0 |
(9) | (3) + (4) | 3 | 49.4 (+1.8) | 49.9 |
(10) | (3) + (5) | 3 | 50.0 (+2.4) | 50.8 |
(11) | (4) + (5) | 3 | 49.0 (+1.4) | 49.6 |
(12) | (3) + (4) + (5) | 4 | 49.6 (+2.0) | 50.2 |
Our method includes three training routes: Route-1, Route-2, and Route-3. All three routes share the same object queries and detection heads for classification and regression. Route-2 serves as the primary route for one-to-one prediction, identical to the baseline models. Route-1 shares self-attention and cross-attention but uses an independent feed-forward network (o2m FFN) for one-to-many prediction. Route-3, sharing all components with the primary route, introduces a novel instructive self-attention, implemented by adding a learnable instruction token to the object queries to guide them and the subsequent network for one-to-many prediction. During inference, the auxiliary routes, Route-1 and Route-3, are discarded.
Epochs | w/ Mr. DETR | Mask mAP | Box mAP |
---|---|---|---|
12 | 32.4 | 46.5 | |
12 | ✔ | 36.0 (+3.6) | 49.5 (+3.0) |
24 | 35.1 | 48.6 | |
24 | ✔ | 37.6 (+2.5) | 50.3 (+1.7) |
Instance segmentation results on the COCO 2017 validation set. All experiments are based on the Deformable-DETR++ with 300 queries and ResNet-50 as backbone.
Route-1 | Route-2 | Route-3 | mAP | AP50 | AP75 |
---|---|---|---|---|---|
✔ | 47.6 | 65.8 | 51.8 | ||
✔ | 49.6 (+2.0) | 67.4 | 54.2 | ||
✔ | 50.4 (+2.8) | 67.9 | 55.3 | ||
✔ | ✔ | ✔ | 50.7 (+3.1) | 68.2 | 55.4 |
The ablation study of different routes in our method. 'Route-1': the auxiliary training route with independent FFN. 'Route-2': the primary route for one-to-one prediction. 'Route-3': the auxiliary training route with instructive self-attention.
We further visualize the attention maps of the instructive self-attention, which reveals that when the 300 object queries act as query and the 10 instruction tokens as key, nearly all 300 object queries exhibit strong activation with the instruction tokens. This indicates that instruction tokens effectively convey information to object queries and subsequent network layers, aiding the model in achieving one-to-many predictions.
@inproceedings{zhang2024mr,
title={Mr. DETR: Instructive Multi-Route Training for Detection Transformers},
author={Zhang, Chang-Bin and Zhong, Yujie and Han, Kai},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}