Single-reference unseen object 6D pose estimation reduces object onboarding by estimating poses of arbitrary novel objects from only one reference view. Recent correspondence-based pipelines have achieved robust performance with vision foundation model (VFM) features. However, they typically treat these features as intra-view descriptors, leaving dense visual-semantic cues, including appearance, structure, and context, insufficiently exchanged across views before geometric decoding. Consequently, the decoded point features may lack joint semantic and geometric discriminability, making correspondence estimation still difficult in challenging cases. Instead of processing features independently, we build the correspondence pipeline around an early cross-view semantic prior. Specifically, cross-view semantic interaction (CVSI) enables dense query and reference VFM tokens to exchange semantic context and form a cross-view prior. Nevertheless, direct CVSI may disturb the VFM token structure, while the resulting semantic prior still needs 3D representation consistency for rigid correspondence. To make this CVSI prior reliable for 3D correspondence learning, we introduce two complementary training-time constraints: the intra-view structure preservation (IVSP) loss preserves the original intra-view token affinity structure during interaction, while the reference-anchored geometric consistency (RAGC) loss enforces spatial representation consistency of decoded point features. The final pose is recovered from learned correspondences through weighted SVD. We further construct a challenging view-pair protocol from the BOP Challenge datasets YCB-V and TUD-L to evaluate robustness in difficult matching scenarios. Extensive experiments on six benchmarks under different view-pair settings show that our method achieves state-of-the-art performance while maintaining comparable inference speed.
Pose estimation results on LM-O, TUD-L, and YCB-V under the view-pair setting of UNOPose. Object masks are obtained by SAM. The AR of the BOP metric and the average time per image are reported.
| Method | Modality | Reference | LM-O | TUD-L | YCB-V | Mean | Time |
|---|---|---|---|---|---|---|---|
| FPFH+RANSAC | D | Image | 31.0 | 31.0 | 50.0 | 37.3 | 6.38 |
| FPFH+MAC | D | Image | 22.5 | 22.1 | 49.6 | 31.4 | 136.94 |
| PPF | D | Image | 29.7 | 14.8 | 38.3 | 27.6 | 11.79 |
| PPF_3D_ICP | D | Image | 44.7 | 29.1 | 66.8 | 46.9 | 14.27 |
| FCGF+RANSAC | D | Image | 38.9 | 59.0 | 57.6 | 51.8 | 10.96 |
| FCGF+MAC | D | Image | 33.9 | 48.3 | 51.0 | 44.4 | 60.53 |
| UTOPIC | D | Image | 13.7 | 35.4 | 10.5 | 19.9 | 4.00 |
| GeDi | D | Image | 42.8 | 67.3 | 60.6 | 56.9 | 48.89 |
| FreeZe | RGB-D | Image | 45.5 | 68.3 | 65.5 | 59.8 | 52.96 |
| SAM-6D | RGB-D | Posed Image | 54.5 | 29.7 | 68.1 | 50.8 | 4.21 |
| SinRef-6D | RGB-D | Posed Image | 51.1 | 34.0 | 73.9 | 53.0 | 4.231 |
| CoordAR | RGB-D | Image | 46.7 | 52.1 | 67.3 | 55.4 | 6.063 |
| UNOPose | RGB-D | Image | 58.7 | 71.0 | 83.1 | 70.9 | 4.179 |
| COG | RGB-D | Image | 60.8 | 80.0 | 80.5 | 73.8 | 4.324 |
| Ours | RGB-D | Image | 61.2 | 82.0 | 86.7 | 76.6 | 4.232 |
Pose estimation results on Real275 and Toyota-Light under the view-pair protocol of Oryon and Horyon, where ground truth object masks are used. We compare RGB and RGB-D methods with only a single reference view.
| Method | Modality | Real275 AR | Real275 ADD(-S) | Toyota-Light AR | Toyota-Light ADD(-S) |
|---|---|---|---|---|---|
| PoseDiffusion | RGB | 9.2 | 0.8 | 7.8 | 1.2 |
| RelPose++ | RGB | 22.8 | 11.9 | 30.9 | 11.6 |
| LatentFusion | RGB | 22.6 | 9.6 | 28.2 | 10.2 |
| SIFT | RGB-D | 34.1 | 16.4 | 30.3 | 14.1 |
| ObjectMatch | RGB-D | 26.0 | 13.4 | 9.8 | 5.4 |
| Oryon | RGB-D | 46.5 | 34.9 | 34.1 | 22.9 |
| One2Any | RGB-D | 54.9 | 41.0 | 42.0 | 34.6 |
| Horyon | RGB-D | 57.9 | 51.6 | 33.0 | 25.1 |
| Any6D | RGB-D | 51.0 | 53.5 | 43.3 | 32.2 |
| UNOPose | RGB-D | 77.9 | 84.4 | 74.9 | 73.2 |
| SinRef-6D | RGB-D | 74.4 | 81.1 | 66.7 | 67.1 |
| ConceptPose | RGB-D | 60.4 | 71.5 | 51.6 | 55.0 |
| CoordAR | RGB-D | 71.0 | 82.2 | 62.5 | 82.6 |
| Ours | RGB-D | 78.1 | 86.2 | 78.4 | 80.0 |
Pose estimation results on LINEMOD. The upper block reports multi-reference methods, and the lower block reports single-reference methods. For all single-reference image methods, the first view is used as the reference.
| Method | Modality | Ref. Images | ape | benchvise | cam | can | cat | driller | duck | eggbox | glue | holepuncher | iron | lamp | phone | LINEMOD Mean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OnePose | RGB | 200 | 11.8 | 92.6 | 88.1 | 77.2 | 47.9 | 74.5 | 34.2 | 71.3 | 37.5 | 54.9 | 89.2 | 87.6 | 60.6 | 63.6 |
| OnePose++ | RGB | 200 | 31.2 | 97.3 | 88.0 | 89.8 | 70.4 | 92.5 | 42.3 | 99.7 | 48.0 | 69.7 | 97.4 | 97.8 | 76.0 | 76.9 |
| LatentFusion | RGB-D | 16 | 88.0 | 92.4 | 74.4 | 88.8 | 94.5 | 91.7 | 68.1 | 96.3 | 49.4 | 82.1 | 74.6 | 94.7 | 91.5 | 83.6 |
| FS6D + ICP | RGB-D | 16 | 78.0 | 88.5 | 91.0 | 89.5 | 97.5 | 92.0 | 75.5 | 99.5 | 99.5 | 96.0 | 87.5 | 97.0 | 97.5 | 91.5 |
| FoundationPose | RGB-D | 1-CAD | 36.5 | 55.5 | 84.2 | 71.7 | 65.3 | 16.3 | 49.8 | 42.6 | 64.8 | 52.7 | 20.7 | 15.8 | 51.7 | 48.3 |
| NOPE | RGB | 1 + GT trans | 2.0 | 4.5 | 2.5 | 2.2 | 0.7 | 4.7 | 0.5 | 100.0 | 79.4 | 2.9 | 4.5 | 4.2 | 3.9 | 16.3 |
| Oryon | RGB-D | 1 | 1.2 | 1.3 | 3.9 | 0.8 | 12.7 | 8.5 | 0.8 | 63.2 | 18.4 | 1.6 | 0.6 | 2.9 | 11.7 | 9.8 |
| One2Any | RGB-D | 1 | 33.1 | 15.7 | 72.7 | 37.0 | 66.2 | 68.2 | 35.8 | 100.0 | 99.9 | 42.0 | 28.2 | 31.9 | 53.2 | 52.6 |
| UNOPose | RGB-D | 1 | 44.6 | 54.9 | 80.2 | 47.1 | 80.7 | 89.4 | 45.2 | 99.2 | 97.2 | 75.3 | 51.8 | 64.0 | 76.6 | 69.7 |
| CoordAR | RGB-D | 1 | 45.6 | 76.9 | 70.7 | 77.3 | 88.1 | 96.5 | 50.2 | 97.0 | 99.8 | 67.5 | 52.7 | 91.4 | 61.2 | 75.0 |
| SinRef-6D | RGB-D | 1 | 49.4 | 82.1 | 63.2 | 58.1 | 88.3 | 76.9 | 53.7 | 99.7 | 83.5 | 46.4 | 86.5 | 99.6 | 85.8 | 74.9 |
| Ours | RGB-D | 1 | 43.0 | 92.6 | 82.8 | 81.4 | 87.5 | 78.9 | 63.2 | 99.7 | 75.4 | 77.4 | 66.1 | 72.6 | 85.9 | 77.4 |
This study shows that the improvement is not simply due to a stronger backbone. CVSI provides the main gain, while RAGC and IVSP further stabilize and regularize the learned point features.
| Row | Backbone | CVSI | RAGC | ISP | VSD | MSSD | MSPD | AR_BOP |
|---|---|---|---|---|---|---|---|---|
| A0 | DINOv2 | ✗ | ✗ | ✗ | 82.6 | 87.4 | 79.4 | 83.1 |
| A1 | ✓ | ✓ | ✓ | 83.0 | 90.4 | 84.5 | 86.0 | |
| B0 | DINOv3 | ✗ | ✗ | ✗ | 82.4 | 88.4 | 81.7 | 84.2 |
| B1 | ✓ | ✗ | ✗ | 82.9 | 90.1 | 84.0 | 85.7 | |
| B2 | ✗ | ✓ | ✗ | 83.0 | 89.0 | 82.2 | 84.7 | |
| B3 | ✓ | ✓ | ✓ | 83.5 | 90.7 | 84.5 | 86.2 | |
| B4 | ✓ | ✓ | ✓ | 83.9* | 91.1* | 85.1* | 86.7* | |
| B5 | ✓ | ✓ | ✗ | 82.9 | 90.2 | 84.1 | 85.7 |
This study shows that stronger geometry alone is not enough. The best performance is achieved when the geometric decoder is paired with the cross-view semantic prior, rather than simply making the decoder deeper.
| Row | Geo. Dec. Layers | CVSI Prior | VSD | MSSD | MSPD | AR_BOP | Train Params (M) | GFLOPs | Runtime (s) |
|---|---|---|---|---|---|---|---|---|---|
| C0 | 1 | ✓ | 81.9 | 89.7 | 82.3 | 84.6 | 23.35 | 193.79 | 0.625 |
| D0 | 2 | ✓ | 82.5 | 89.7 | 83.0 | 85.1 | 26.12 | 221.41 | 0.662 |
| E0 | 3 | ✗ | 82.4 | 88.4 | 81.7 | 84.2 | 21.37 | 233.61 | 0.686 |
| E1 | ✓ | 83.9 | 91.1 | 85.1 | 86.7 | 28.88 | 249.02 | 0.711 | |
| F0 | 4 | ✗ | 82.5 | 88.5 | 82.0 | 84.3 | 24.14 | 261.23 | 0.721 |
| F1 | ✓ | 83.3 | 90.7 | 84.8 | 86.3 | 31.65 | 276.64 | 0.789 | |
| G0 | 5 | ✗ | 82.6 | 88.8 | 82.4 | 84.6 | 26.91 | 288.85 | 0.777 |
| G1 | ✓ | 83.8 | 90.8 | 84.9 | 86.5 | 34.42 | 304.26 | 0.794 | |
| H1 | 6 | ✗ | 82.6 | 88.5 | 82.4 | 84.5 | 29.68 | 316.46 | 0.803 |
This study shows that the proposed cross-view semantic prior is especially helpful when the viewpoint gap becomes large. The gain increases most clearly in the 70°–90° range, where valid overlap is sparse and ambiguous.
If you find our work useful for your research, please consider citing:
@misc{chen2026learningcrossviewsemanticpriors,
title={Learning Cross-View Semantic Priors for Single-Reference Unseen Object Pose Estimation},
author={Jiahong Chen and Jinghao Wang and Ziwen Wang and Zi Wang and Banglei Guan and Qifeng Yu},
year={2026},
eprint={2606.22076},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.22076}
}