Learning Cross-View Semantic Priors for Single-Reference Unseen Object Pose Estimation

Jiahong Chen, Jinghao Wang, Ziwen Wang, Zi Wang, Banglei Guan, Qifeng Yu
College of Aerospace Science and Engineering, National University of Defense Technology
Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation

Motivation

Existing single-reference correspondence pipelines usually process query and reference semantic features in parallel, so the decoder relies mainly on sparse intra-view cues before point matching. This makes correspondence estimation vulnerable to low-overlap views and mask noise. Our goal is to inject a cross-view semantic prior into the decoding and matching process.

Challenge

  • Directly introducing cross-view semantic interaction may disturb the original VFM token structure.
  • The interacted features still require explicit 3D consistency to support rigid correspondence estimation.
Sample with large viewpoint changes
Teaser figure

Abstract

Single-reference unseen object 6D pose estimation reduces object onboarding by estimating poses of arbitrary novel objects from only one reference view. Recent correspondence-based pipelines have achieved robust performance with vision foundation model (VFM) features. However, they typically treat these features as intra-view descriptors, leaving dense visual-semantic cues, including appearance, structure, and context, insufficiently exchanged across views before geometric decoding. Consequently, the decoded point features may lack joint semantic and geometric discriminability, making correspondence estimation still difficult in challenging cases. Instead of processing features independently, we build the correspondence pipeline around an early cross-view semantic prior. Specifically, cross-view semantic interaction (CVSI) enables dense query and reference VFM tokens to exchange semantic context and form a cross-view prior. Nevertheless, direct CVSI may disturb the VFM token structure, while the resulting semantic prior still needs 3D representation consistency for rigid correspondence. To make this CVSI prior reliable for 3D correspondence learning, we introduce two complementary training-time constraints: the intra-view structure preservation (IVSP) loss preserves the original intra-view token affinity structure during interaction, while the reference-anchored geometric consistency (RAGC) loss enforces spatial representation consistency of decoded point features. The final pose is recovered from learned correspondences through weighted SVD. We further construct a challenging view-pair protocol from the BOP Challenge datasets YCB-V and TUD-L to evaluate robustness in difficult matching scenarios. Extensive experiments on six benchmarks under different view-pair settings show that our method achieves state-of-the-art performance while maintaining comparable inference speed.

Pipeline

Pipeline overview

Qualitative Results

LM-O, TUD-L, and YCB-V

Qualitative comparison on LM-O, TUD-L, and YCB-V

Real275, Toyota-Light, and LINEMOD

Qualitative comparison on Real275, Toyota-Light, and LINEMOD

Challenging View Pair Protocol

Challenging view-pair qualitative comparison

Cross-View Interaction Visualization

Visualization of attention maps in cross-view interaction

Quantitative Results

LM-O, TUD-L, and YCB-V

Pose estimation results on LM-O, TUD-L, and YCB-V under the view-pair setting of UNOPose. Object masks are obtained by SAM. The AR of the BOP metric and the average time per image are reported.

Method Modality Reference LM-O TUD-L YCB-V Mean Time
FPFH+RANSAC D Image 31.0 31.0 50.0 37.3 6.38
FPFH+MAC D Image 22.5 22.1 49.6 31.4 136.94
PPF D Image 29.7 14.8 38.3 27.6 11.79
PPF_3D_ICP D Image 44.7 29.1 66.8 46.9 14.27
FCGF+RANSAC D Image 38.9 59.0 57.6 51.8 10.96
FCGF+MAC D Image 33.9 48.3 51.0 44.4 60.53
UTOPIC D Image 13.7 35.4 10.5 19.9 4.00
GeDi D Image 42.8 67.3 60.6 56.9 48.89
FreeZe RGB-D Image 45.5 68.3 65.5 59.8 52.96
SAM-6D RGB-D Posed Image 54.5 29.7 68.1 50.8 4.21
SinRef-6D RGB-D Posed Image 51.1 34.0 73.9 53.0 4.231
CoordAR RGB-D Image 46.7 52.1 67.3 55.4 6.063
UNOPose RGB-D Image 58.7 71.0 83.1 70.9 4.179
COG RGB-D Image 60.8 80.0 80.5 73.8 4.324
Ours RGB-D Image 61.2 82.0 86.7 76.6 4.232

Real275 and Toyota-Light

Pose estimation results on Real275 and Toyota-Light under the view-pair protocol of Oryon and Horyon, where ground truth object masks are used. We compare RGB and RGB-D methods with only a single reference view.

Method Modality Real275 AR Real275 ADD(-S) Toyota-Light AR Toyota-Light ADD(-S)
PoseDiffusion RGB 9.2 0.8 7.8 1.2
RelPose++ RGB 22.8 11.9 30.9 11.6
LatentFusion RGB 22.6 9.6 28.2 10.2
SIFT RGB-D 34.1 16.4 30.3 14.1
ObjectMatch RGB-D 26.0 13.4 9.8 5.4
Oryon RGB-D 46.5 34.9 34.1 22.9
One2Any RGB-D 54.9 41.0 42.0 34.6
Horyon RGB-D 57.9 51.6 33.0 25.1
Any6D RGB-D 51.0 53.5 43.3 32.2
UNOPose RGB-D 77.9 84.4 74.9 73.2
SinRef-6D RGB-D 74.4 81.1 66.7 67.1
ConceptPose RGB-D 60.4 71.5 51.6 55.0
CoordAR RGB-D 71.0 82.2 62.5 82.6
Ours RGB-D 78.1 86.2 78.4 80.0

LINEMOD

Pose estimation results on LINEMOD. The upper block reports multi-reference methods, and the lower block reports single-reference methods. For all single-reference image methods, the first view is used as the reference.

Method Modality Ref. Images ape benchvise cam can cat driller duck eggbox glue holepuncher iron lamp phone LINEMOD Mean
OnePose RGB 200 11.8 92.6 88.1 77.2 47.9 74.5 34.2 71.3 37.5 54.9 89.2 87.6 60.6 63.6
OnePose++ RGB 200 31.2 97.3 88.0 89.8 70.4 92.5 42.3 99.7 48.0 69.7 97.4 97.8 76.0 76.9
LatentFusion RGB-D 16 88.0 92.4 74.4 88.8 94.5 91.7 68.1 96.3 49.4 82.1 74.6 94.7 91.5 83.6
FS6D + ICP RGB-D 16 78.0 88.5 91.0 89.5 97.5 92.0 75.5 99.5 99.5 96.0 87.5 97.0 97.5 91.5
FoundationPose RGB-D 1-CAD 36.5 55.5 84.2 71.7 65.3 16.3 49.8 42.6 64.8 52.7 20.7 15.8 51.7 48.3
NOPE RGB 1 + GT trans 2.0 4.5 2.5 2.2 0.7 4.7 0.5 100.0 79.4 2.9 4.5 4.2 3.9 16.3
Oryon RGB-D 1 1.2 1.3 3.9 0.8 12.7 8.5 0.8 63.2 18.4 1.6 0.6 2.9 11.7 9.8
One2Any RGB-D 1 33.1 15.7 72.7 37.0 66.2 68.2 35.8 100.0 99.9 42.0 28.2 31.9 53.2 52.6
UNOPose RGB-D 1 44.6 54.9 80.2 47.1 80.7 89.4 45.2 99.2 97.2 75.3 51.8 64.0 76.6 69.7
CoordAR RGB-D 1 45.6 76.9 70.7 77.3 88.1 96.5 50.2 97.0 99.8 67.5 52.7 91.4 61.2 75.0
SinRef-6D RGB-D 1 49.4 82.1 63.2 58.1 88.3 76.9 53.7 99.7 83.5 46.4 86.5 99.6 85.8 74.9
Ours RGB-D 1 43.0 92.6 82.8 81.4 87.5 78.9 63.2 99.7 75.4 77.4 66.1 72.6 85.9 77.4

Ablation Studies

Main Components on YCB-V

This study shows that the improvement is not simply due to a stronger backbone. CVSI provides the main gain, while RAGC and IVSP further stabilize and regularize the learned point features.

Row Backbone CVSI RAGC ISP VSD MSSD MSPD AR_BOP
A0 DINOv2 ✗ ✗ ✗ 82.6 87.4 79.4 83.1
A1 ✓ ✓ ✓ 83.0 90.4 84.5 86.0
B0 DINOv3 ✗ ✗ ✗ 82.4 88.4 81.7 84.2
B1 ✓ ✗ ✗ 82.9 90.1 84.0 85.7
B2 ✗ ✓ ✗ 83.0 89.0 82.2 84.7
B3 ✓ ✓ ✓ 83.5 90.7 84.5 86.2
B4 ✓ ✓ ✓ 83.9* 91.1* 85.1* 86.7*
B5 ✓ ✓ ✗ 82.9 90.2 84.1 85.7

Geometric Decoder Depth

This study shows that stronger geometry alone is not enough. The best performance is achieved when the geometric decoder is paired with the cross-view semantic prior, rather than simply making the decoder deeper.

Row Geo. Dec. Layers CVSI Prior VSD MSSD MSPD AR_BOP Train Params (M) GFLOPs Runtime (s)
C0 1 ✓ 81.9 89.7 82.3 84.6 23.35 193.79 0.625
D0 2 ✓ 82.5 89.7 83.0 85.1 26.12 221.41 0.662
E0 3 ✗ 82.4 88.4 81.7 84.2 21.37 233.61 0.686
E1 ✓ 83.9 91.1 85.1 86.7 28.88 249.02 0.711
F0 4 ✗ 82.5 88.5 82.0 84.3 24.14 261.23 0.721
F1 ✓ 83.3 90.7 84.8 86.3 31.65 276.64 0.789
G0 5 ✗ 82.6 88.8 82.4 84.6 26.91 288.85 0.777
G1 ✓ 83.8 90.8 84.9 86.5 34.42 304.26 0.794
H1 6 ✗ 82.6 88.5 82.4 84.5 29.68 316.46 0.803

Reference Viewpoint Gap

This study shows that the proposed cross-view semantic prior is especially helpful when the viewpoint gap becomes large. The gain increases most clearly in the 70°–90° range, where valid overlap is sparse and ambiguous.

Effect of reference viewpoint gap on pose estimation

Citation

If you find our work useful for your research, please consider citing:

@misc{chen2026learningcrossviewsemanticpriors,
  title={Learning Cross-View Semantic Priors for Single-Reference Unseen Object Pose Estimation},
  author={Jiahong Chen and Jinghao Wang and Ziwen Wang and Zi Wang and Banglei Guan and Qifeng Yu},
  year={2026},
  eprint={2606.22076},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.22076}
}