Fin3R


Fine-Tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation

NeurIPS 2025

Weining Ren1     Hongjun Wang1     Xiao Tan2      Kai Han1     
1HKU     2Baidu VIS    



Fin3R finetunes the 3R family with monocular distillation on unlabelled dataset.

Abstract

We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (i) the scarcity of high-fidelity depth and pose supervision and (ii) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder—the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged.

Method

Fin3R fintetunes the encoder of feed-forward reconstruction models with a custom LoRA. Purple dashed lines indicate distillation supervision on coninical view (depth or pointmap); green dashed lines denote multi-view pointmap supervision. Notice that during finetuning, the decoder is frozen and only the LoRA are updated.

Monocular Depth Estimation

Our fine-tuning method consistently improves the monocular depth estimation quality of various feed-forward 3D reconstruction models, including both two-view and multi-view, relative depth and metric depth models.

Multi-view Performance

Our fine-tuning method consistently improves the pose accuracy of various feed-forward 3D reconstruction models, even without pose supervision during training. This suggests that the decoder functions as an implicit feature matcher, leveraging the improved encoder features to enhance performance without requiring explicit pose labels.

Qualitative Comparison

Our fine-tuning method improves the fine details and robustness of baseline methods.

2D Depth Estimation Results

Select a Scene:

Scene 2
Scene 3
Scene 5
Scene 6
Scene 7
Scene 8
Scene 9
Scene 10
Scene 11
Scene 12

BibTeX

@inproceedings{ren2025fin3r,
  title={Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation},
  author={Ren, Weining and Wang, Hongjun and Tan, Xiao and Han, Kai},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025}
}

Acknowledgements

This work is supported by Hong Kong Research Grant Council - General Research Fund (Grant No. 17213825). Weining Ren is supported by Hong Kong PhD Fellowship Scheme (HKPFS).