Learning Cross-View Semantic Priors for Single-Reference Unseen Object Pose Estimation

Abstract

Single-reference unseen object 6D pose estimation reduces object onboarding by estimating poses of arbitrary novel objects from only one reference view. Recent correspondence-based pipelines have achieved robust performance with vision foundation model (VFM) features. However, they typically treat these features as intra-view descriptors, leaving dense visual-semantic cues, including appearance, structure, and context, insufficiently exchanged across views before geometric decoding. Consequently, the decoded point features may lack joint semantic and geometric discriminability, making correspondence estimation still difficult in challenging cases. Instead of processing features independently, we build the correspondence pipeline around an early cross-view semantic prior. Specifically, cross-view semantic interaction (CVSI) enables dense query and reference VFM tokens to exchange semantic context and form a cross-view prior. Nevertheless, direct CVSI may disturb the VFM token structure, while the resulting semantic prior still needs 3D representation consistency for rigid correspondence. To make this CVSI prior reliable for 3D correspondence learning, we introduce two complementary training-time constraints: the intra-view structure preservation (IVSP) loss preserves the original intra-view token affinity structure during interaction, while the reference-anchored geometric consistency (RAGC) loss enforces spatial representation consistency of decoded point features. The final pose is recovered from learned correspondences through weighted SVD. We further construct a challenging view-pair protocol from the BOP Challenge datasets YCB-V and TUD-L to evaluate robustness in difficult matching scenarios. Extensive experiments on six benchmarks under different view-pair settings show that our method achieves state-of-the-art performance while maintaining comparable inference speed.

Pipeline

Qualitative Results

LM-O, TUD-L, and YCB-V

Real275, Toyota-Light, and LINEMOD

Challenging View Pair Protocol

Challenging view-pair qualitative comparison

Cross-View Interaction Visualization

Visualization of attention maps in cross-view interaction

Quantitative Results

LM-O, TUD-L, and YCB-V

Pose estimation results on LM-O, TUD-L, and YCB-V under the view-pair setting of UNOPose. Object masks are obtained by SAM. The AR of the BOP metric and the average time per image are reported.

Method	Modality	Reference	LM-O	TUD-L	YCB-V	Mean	Time
FPFH+RANSAC	D	Image	31.0	31.0	50.0	37.3	6.38
FPFH+MAC	D	Image	22.5	22.1	49.6	31.4	136.94
PPF	D	Image	29.7	14.8	38.3	27.6	11.79
PPF_3D_ICP	D	Image	44.7	29.1	66.8	46.9	14.27
FCGF+RANSAC	D	Image	38.9	59.0	57.6	51.8	10.96
FCGF+MAC	D	Image	33.9	48.3	51.0	44.4	60.53
UTOPIC	D	Image	13.7	35.4	10.5	19.9	4.00
GeDi	D	Image	42.8	67.3	60.6	56.9	48.89
FreeZe	RGB-D	Image	45.5	68.3	65.5	59.8	52.96
SAM-6D	RGB-D	Posed Image	54.5	29.7	68.1	50.8	4.21
SinRef-6D	RGB-D	Posed Image	51.1	34.0	73.9	53.0	4.231
CoordAR	RGB-D	Image	46.7	52.1	67.3	55.4	6.063
UNOPose	RGB-D	Image	58.7	71.0	83.1	70.9	4.179
COG	RGB-D	Image	60.8	80.0	80.5	73.8	4.324
Ours	RGB-D	Image	61.2	82.0	86.7	76.6	4.232

Real275 and Toyota-Light

Pose estimation results on Real275 and Toyota-Light under the view-pair protocol of Oryon and Horyon, where ground truth object masks are used. We compare RGB and RGB-D methods with only a single reference view.

Method	Modality	Real275 AR	Real275 ADD(-S)	Toyota-Light AR	Toyota-Light ADD(-S)
PoseDiffusion	RGB	9.2	0.8	7.8	1.2
RelPose++	RGB	22.8	11.9	30.9	11.6
LatentFusion	RGB	22.6	9.6	28.2	10.2
SIFT	RGB-D	34.1	16.4	30.3	14.1
ObjectMatch	RGB-D	26.0	13.4	9.8	5.4
Oryon	RGB-D	46.5	34.9	34.1	22.9
One2Any	RGB-D	54.9	41.0	42.0	34.6
Horyon	RGB-D	57.9	51.6	33.0	25.1
Any6D	RGB-D	51.0	53.5	43.3	32.2
UNOPose	RGB-D	77.9	84.4	74.9	73.2
SinRef-6D	RGB-D	74.4	81.1	66.7	67.1
ConceptPose	RGB-D	60.4	71.5	51.6	55.0
CoordAR	RGB-D	71.0	82.2	62.5	82.6
Ours	RGB-D	78.1	86.2	78.4	80.0

LINEMOD

Pose estimation results on LINEMOD. The upper block reports multi-reference methods, and the lower block reports single-reference methods. For all single-reference image methods, the first view is used as the reference.

Method	Modality	Ref. Images	ape	benchvise	cam	can	cat	driller	duck	eggbox	glue	holepuncher	iron	lamp	phone	LINEMOD Mean
OnePose	RGB	200	11.8	92.6	88.1	77.2	47.9	74.5	34.2	71.3	37.5	54.9	89.2	87.6	60.6	63.6
OnePose++	RGB	200	31.2	97.3	88.0	89.8	70.4	92.5	42.3	99.7	48.0	69.7	97.4	97.8	76.0	76.9
LatentFusion	RGB-D	16	88.0	92.4	74.4	88.8	94.5	91.7	68.1	96.3	49.4	82.1	74.6	94.7	91.5	83.6
FS6D + ICP	RGB-D	16	78.0	88.5	91.0	89.5	97.5	92.0	75.5	99.5	99.5	96.0	87.5	97.0	97.5	91.5
FoundationPose	RGB-D	1-CAD	36.5	55.5	84.2	71.7	65.3	16.3	49.8	42.6	64.8	52.7	20.7	15.8	51.7	48.3
NOPE	RGB	1 + GT trans	2.0	4.5	2.5	2.2	0.7	4.7	0.5	100.0	79.4	2.9	4.5	4.2	3.9	16.3
Oryon	RGB-D	1	1.2	1.3	3.9	0.8	12.7	8.5	0.8	63.2	18.4	1.6	0.6	2.9	11.7	9.8
One2Any	RGB-D	1	33.1	15.7	72.7	37.0	66.2	68.2	35.8	100.0	99.9	42.0	28.2	31.9	53.2	52.6
UNOPose	RGB-D	1	44.6	54.9	80.2	47.1	80.7	89.4	45.2	99.2	97.2	75.3	51.8	64.0	76.6	69.7
CoordAR	RGB-D	1	45.6	76.9	70.7	77.3	88.1	96.5	50.2	97.0	99.8	67.5	52.7	91.4	61.2	75.0
SinRef-6D	RGB-D	1	49.4	82.1	63.2	58.1	88.3	76.9	53.7	99.7	83.5	46.4	86.5	99.6	85.8	74.9
Ours	RGB-D	1	43.0	92.6	82.8	81.4	87.5	78.9	63.2	99.7	75.4	77.4	66.1	72.6	85.9	77.4

Ablation Studies

Main Components on YCB-V

This study shows that the improvement is not simply due to a stronger backbone. CVSI provides the main gain, while RAGC and IVSP further stabilize and regularize the learned point features.

Row	Backbone	CVSI	RAGC	ISP	VSD	MSSD	MSPD	AR_BOP
A0	DINOv2	✗	✗	✗	82.6	87.4	79.4	83.1
A1	DINOv2	✓	✓	✓	83.0	90.4	84.5	86.0
B0	DINOv3	✗	✗	✗	82.4	88.4	81.7	84.2
B1		✓	✗	✗	82.9	90.1	84.0	85.7
B2		✗	✓	✗	83.0	89.0	82.2	84.7
B3		✓	✓	✓	83.5	90.7	84.5	86.2
B4		✓	✓	✓	83.9*	91.1*	85.1*	86.7*
B5		✓	✓	✗	82.9	90.2	84.1	85.7

Geometric Decoder Depth

This study shows that stronger geometry alone is not enough. The best performance is achieved when the geometric decoder is paired with the cross-view semantic prior, rather than simply making the decoder deeper.

Row	Geo. Dec. Layers	CVSI Prior	VSD	MSSD	MSPD	AR_BOP	Train Params (M)	GFLOPs	Runtime (s)
C0	1	✓	81.9	89.7	82.3	84.6	23.35	193.79	0.625
D0	2	✓	82.5	89.7	83.0	85.1	26.12	221.41	0.662
E0	3	✗	82.4	88.4	81.7	84.2	21.37	233.61	0.686
E1	3	✓	83.9	91.1	85.1	86.7	28.88	249.02	0.711
F0	4	✗	82.5	88.5	82.0	84.3	24.14	261.23	0.721
F1	4	✓	83.3	90.7	84.8	86.3	31.65	276.64	0.789
G0	5	✗	82.6	88.8	82.4	84.6	26.91	288.85	0.777
G1	5	✓	83.8	90.8	84.9	86.5	34.42	304.26	0.794
H1	6	✗	82.6	88.5	82.4	84.5	29.68	316.46	0.803

Reference Viewpoint Gap

This study shows that the proposed cross-view semantic prior is especially helpful when the viewpoint gap becomes large. The gain increases most clearly in the 70°–90° range, where valid overlap is sparse and ambiguous.

Citation

If you find our work useful for your research, please consider citing:

@misc{chen2026learningcrossviewsemanticpriors,
  title={Learning Cross-View Semantic Priors for Single-Reference Unseen Object Pose Estimation},
  author={Jiahong Chen and Jinghao Wang and Ziwen Wang and Zi Wang and Banglei Guan and Qifeng Yu},
  year={2026},
  eprint={2606.22076},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.22076}
}

Learning Cross-View Semantic Priors for Single-Reference Unseen Object Pose Estimation

Motivation

Challenge

Abstract

Pipeline

Qualitative Results

LM-O, TUD-L, and YCB-V

Real275, Toyota-Light, and LINEMOD

Challenging View Pair Protocol

Cross-View Interaction Visualization

Quantitative Results

LM-O, TUD-L, and YCB-V

Real275 and Toyota-Light

LINEMOD

Ablation Studies

Main Components on YCB-V

Geometric Decoder Depth

Reference Viewpoint Gap

Citation