【论文笔记】SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation

发表于 2025-03-01 分类于读万卷书阅读次数： Waline：本文字数： 7.6k 阅读时长 ≈ 27 分钟

Category-level object pose estimation, aiming to predict the 6D pose and 3D size of objects from known categories, typically struggles with large intra-class shape variation. Existing works utilizing mean shapes often fall short of capturing this variation. To address this issue, we present SecondPose, a novel approach integrating object-specific geometric features with semantic category priors from DINOv2. Leveraging the advantage of DINOv2 in providing SE(3)-consistent semantic features, we hierarchically extract two types of SE(3)-invariant geometric features to further encapsulate local-to-global object-specific information. These geometric features are then point-aligned with DINOv2 features to establish a consistent object representation under SE(3) transformations, facilitating the mapping from camera space to the pre-defined canonical space, thus further enhancing pose estimation. Extensive experiments on NOCS-REAL275 demonstrate that SecondPose achieves a 12.4% leap forward over the state-of-the-art. Moreover, on a more complex dataset HouseCat6D which provides photometrically challenging objects, SecondPose still surpasses other competitors by a large margin.

SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation

方法	类型	训练输入	推理输入	输出	pipeline
SecondPose	类别级	RGBD	RGBD	绝对\(\mathbf{R}, \mathbf{t}, \mathbf{s}\)

2025.03.01：使用DINOv2提取RGB特征，使用HP-PPF提取分层次的点云特征，同时也用上了RGB的原始图像，除此之外，方法训练了两个网络，一个网络预测旋转\(\mathbf{R}\)，另一个网络训练\(\mathbf{t}\)和\(\mathbf{s}\)。在训练时，预测旋转\(\mathbf{R}\)的网络会使用数据集中的真实\(\mathbf{t}\)和\(\mathbf{s}\)作为输入，预测\(\mathbf{R}\)；预测\(\mathbf{t}\)和\(\mathbf{s}\)的网络会使用数据集中的真实\(\mathbf{R}\)作为输入，预测\(\mathbf{t}\)和\(\mathbf{s}\)。

Abstract

1. Introduction

Figure 1. Categorical SE(3)-consistent features. We visualize our fused features by PCA. Colored points highlight the most corresponding parts, where our proposed feature achieves consistent alignment cross instances (left vs. middle) and maintains consistency on the same instance of different poses (middle vs. right).

Figure 2. Illustration of SecondPose. Semantic features are extracted using the DINOv2 model (A), and the HP-PPF feature is computed on the point cloud (B). These features, combined with RGB values, are fused into our SECOND feature F_f (C) using stream-specific modules L_s, L_g, L_c, and a shared module L_f for concatenated features. The resulting fused features, in conjunction with the point cloud, are utilized for pose estimation (D). — Figure 2. Illustration of SecondPose. Semantic features are extracted using the DINOv2 model (A), and the HP-PPF feature is computed on the point cloud (B). These features, combined with RGB values, are fused into our SECOND feature \(F_f\) (C) using stream-specific modules \(L_s\), \(L_g\), \(L_c\), and a shared module \(L_f\) for concatenated features. The resulting fused features, in conjunction with the point cloud, are utilized for pose estimation (D).

To summarize, our main contributions are threefold:

We present SecondPose, the first method to directly fuse object-specific hierarchical geometric features with semantic DINOv2 features for category-level pose estimation.
Our SE(3)-consistent dual-stream feature fusion strategy yields a unified object representation that is robust under SE(3) transformations, better suited for downstream pose estimation.
Extensive evaluation proves that our SE(3)-consistent fusion strategy significantly boosts pose estimation performance even under severe occlusion and clutter, enabling real-world applications.

3. Method

SecondPose的目标是从单张RGB-D图像中估计物体的9DoF位姿。具体来说，给定一张从一组已知类别中捕捉到目标物体的RGB-D图像，我们的目标是恢复其完整的9DoF物体位姿，包括\(R \in SO(3)\)、三维平移向量\(t \in \mathbb{R}^3\)以及三维度量尺寸\(s \in \mathbb{R}^3\)。

3.1. Overview

如图2所示，SecondPose主要由三个模块组成，用于从单张RGB-D输入图像中预测物体位姿，即：

提取相关的几何特征\(F_g\)和语义特征\(F_s\)；
进行双流特征融合，以构建与特殊欧几里得群SE(3)一致的物体表示\(F_f\)；
根据所提取的表示进行最终的位姿回归。

3.2. Semantic Category Prior From DINOv2

DINOv2 is an implicit rotation learner

我们使用DINOv2作为我们的图像特征提取器。正如“A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence”中所示，DINOv2能够从RGB图像中提取具有语义感知的信息，这些信息可以被很好地用于建立零样本语义对应关系，这使得它成为一种用于提取丰富语义信息的出色方法。

至于估计三维旋转，这种额外的具有语义感知的信息能够显著提升性能。举例来说，设想在模型空间中，\(z\)轴通常指向物体的顶部，\(y\)轴总是指向物体的正面，而\(x\)轴总是指向物体的左侧。利用DINOv2所提供的语义信息，模型能够更轻松地识别物体的顶部、正面和左侧，从而使旋转估计变成一项简单得多的任务。此外，DINOv2的特征还包含了关于物体的全局信息，包括物体的类别和位姿。因此，这样的信息可以作为我们方法的一个良好的全局先验条件。

Deeper DINOv2 features

我们使用来自最后一层（第11层）的facet token的特征作为我们提取的语义特征。从本质上讲，“A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence”已经证明，更深层的特征展现出了最佳的语义匹配性能，因此在不同物体间的语义对应方面提供了更高的一致性。此外，更深层的特征还拥有更全面的语义信息。相关的可视化内容如图2.A所示。

Direct pose estimation from DINOv2

如前所述，将DINOv2特征与反投影点进行临时融合存在一些缺点。首先，DINOv2仅从RGB图像中提取信息，因此，其包含的几何信息是有限的。其次，由于我们使用DINOv2更深层的特征来获得更全面的表示，局部的详细信息在一定程度上会变得模糊。为了在这些方面补充DINOv2的特征，我们因此需要将它们与包含局部信息的几何特征相结合，以获得更强的描述能力。

3.3. Hierarchical Geometric Features

Figure 3. Hierarchical panel-based geometric features. The inner panel contains points that are close to the point of interest, and outer panels contain points far from the point of interest.

该数据流管道如图2.B所示。在这个数据流中，我们的几何嵌入是基于计算成对且具有SE(3)等变性的点对特征（PPFs）。我们通过聚合感兴趣点与其以该点为中心的多个邻域块中的邻域点之间的PPFs，构建出具有SE(3)不变性的坐标表示。我们将每个邻域块中对应的SE(3)不变性坐标表示进行分层拼接，以增强我们的几何特征HP-PPF的表示能力。图3.c展示了HP-PPF的可视化结果。

Point Pair Features PPFs

图3.a展示了一个全面的示例。给定一个表示为\(P\)的物体点云，我们考虑每一对点\((p_i, p_j)\)，其中\(p_i, p_j \in P\)。对于每个点，分别在\(p_i\)和\(p_j\)处计算局部法向量\(n_i\)和\(n_j\)。\(p_i\)和\(p_j\)之间最终的成对特征定义如下：

\[ \begin{equation}\label{eq1} f_{i, j} = [d_{i, j}, \alpha_{i, j}, \beta_{i, j}, \theta_{i, j}], \end{equation} \]

其中\(d_{i, j} = \Vert p_j - p_i\Vert\)描述了点\(p_i\)和\(p_j\)之间的欧几里得距离。\(\alpha_{i, j} = \angle(n_i, p_j - p_i)\)表示点\(p_i\)处的法向量\(n_i\)与从\(p_i\)指向\(p_j\)的向量之间的角度偏差。\(\beta_{i, j} = \angle(n_j, p_j - p_i)\)表示点\(p_j\)处的法向量\(n_j\)与上述从\(p_i\)指向\(p_j\)的向量所夹的角。\(\theta_{i, j} = \angle(n_j, n_i)\)分别表示点\(p_j\)和\(p_i\)处的法向量\(n_j\)和\(n_i\)之间的角度差异。请注意，由于其具有局部性，这个描述符在特殊欧几里得群SE(3)变换下是不变的。

Geometric Feature Panel

基于点对特征（PPFs），我们提出了基于邻域块的PPFs来构建我们的几何表示，这在保持局部性优点的同时扩大了感知范围。对于点云\(P\)中的每个点\(p_i\)，存在一个支撑邻域块\(\mathcal{S}^i \subseteq P\)，其元素数量为\(s_i = |\mathcal{S}^i|\)。对于所有的\(p_j \in \mathcal{S}^i\)，我们计算\(p_i\)和\(p_j\)之间的PPF \(f_{i, j}\)，通过求平均值得到\(p_i\)的局部坐标表示\(f_l^i\)：

\[ \begin{equation}\label{eq2} f_l^i = \frac{1}{s_i}(\sum_j d_{i, j}, \sum_j \alpha_{i, j}, \sum_j \beta_{i, j}, \sum_j \theta_{i, j}). \end{equation} \]

From Single to Hierarchical Panels

尽管邻域块内的均值聚合能够考虑到相邻点，但固有的局部表示限制了其表征能力，因为在限定感知范围时，法向量\(n_i\)、\(n_j\)所带来的特征存在噪声。受卷积神经网络（CNNs）从局部到全局提取分层特征的启发，我们从局部到全局对多个邻域块进行分层采样，如图3.b所示。具体来说，对于一个元素数量为\(|P|\)的点集\(P\)，对于满足\(0 = k_0 < k_1 < k_2 < \cdots < k_l = |P| - 1\)的整数\((k_0, k_1, k_2, \cdots, k_l)\)，对于\(p_i \in P\)，我们首先将其到点集\(P\)中其他任意点的距离按从小到大的顺序进行排序：

\[ \begin{equation}\label{eq3} r_{i, j} = sort(d_{i, j}) \end{equation} \]

并且构建支撑邻域块：

\[ \begin{equation}\label{eq4} \mathcal{S}^{i, m} = \{p_j \in \mathbf{P}|k_{m - 1} < r_{i, j} \leq k_m\}, 1 \leq m \leq l, \end{equation} \]

其中\(l\)是所使用邻域块的数量。然后，我们为每个邻域块\(\mathcal{S}^{i, m}\)计算相应的位姿不变坐标表示\(f^{i, m}\)，并将它们拼接起来，以得到逐点的几何特征，其形式为：

\[ \begin{equation}\label{eq5} f_g^i = f_l^{i, 1} \oplus f_l^{i, 2} \oplus \cdots \oplus f_l^{i, l}. \end{equation} \]

因此，当\(k\)值较小时，支撑邻域块由离感兴趣点较近的点组成；而当\(k\)值较大时，支撑邻域块由离感兴趣点较远的点组成。通过拼接不同尺度邻域块计算得到的特征，我们能够以一种平衡局部几何特征细节和全局实例级形状信息的方式利用几何特征。我们将在第4节通过实验表明，我们的设计比常见的单邻域块描述符表现更优。

3.4. SE(3)-Consistent Feature Fusion

Fusion Strategy

如图2.C所示，我们将DINOv2特征、几何特征和RGB值进行融合。具体而言，我们以VI-Net作为位姿估计器的示例，首先将每个特征投影到每个特征流\(\mathcal{F}\)，并将三维点云\(P = \{p_i\}\)投影到一个球形特征图\(F\)上。为此，我们按照VI-Net的方法，沿着方位角和仰角轴将球体均匀划分为\(W \times H\)个区域。我们将距离最远的点的特征分配给每个区域。当该区域内没有点时，我们将该区域的值设为0。对于表示几何特征、DINOv2特征和各自RGB值的每个特征图\(F_i \in \{F_g, F_s, F_c\}\)，我们使用一个单独的ResNet模型\(\mathcal{L}_i\)作为特征提取器。然后将这些单独的特征提取器的输出进行拼接，作为另一个ResNet的输入以进行特征融合，得到也记为SECOND的特征\(F_f\)，

\[ \begin{equation}\label{eq6} F_f = \mathcal{L}_f(\mathcal{L}_g(F_g) \oplus \mathcal{L}_s(F_s) \oplus \mathcal{L}_c(F_c)). \end{equation} \]

Advantages of SE(3)-Consistent Fusion

设计一种与SE(3)一致的融合方法是提升我们方法质量的关键部分。对于三维旋转，我们正在学习一个从点云及其特征空间\((P, F) \in \mathbb{R}^{n \times 3} \times \mathbb{R}^{n \times C}\)到三维旋转空间\(R \in SO(3)\)的映射

\[ \begin{equation}\label{eq7} \Phi: \mathbb{R}^{n \times 3} \times \mathbb{R}^{n \times C} \mapsto SO(3). \end{equation} \]

这个映射\(\Phi\)应确保具有旋转等变性，即

\[ \begin{equation}\label{eq8} \Phi(R_x P, \psi_{R_x}(F)) = R_x\Phi(P, F), \forall R_x \in SO(3), \end{equation} \]

其中\(\psi_{R_x}\)是当点云绕\(R_x\)旋转时应用于特征的变换。这种旋转等变关系对于所学模型在未见数据上实现良好泛化至关重要。如果模型结构中不嵌入这种等变性，就需要通过大量数据来学习这些关系，而这会受到数据规模的限制。我们设计的与SE(3)一致的特征近似具有旋转不变性，因此

\[ \begin{equation}\label{eq9} \psi_{R_x}(F) \approx F, \forall R_x \in SO(3), \end{equation} \]

这消除了式\(\eqref{eq8}\)中\(\psi_{R_x}\)的影响，从而使旋转等变关系的学习变得更容易。

3.5. SecondPose Training and Inference

参考VI-Net，我们采用轻量级的PointNet++作为平移和尺寸估计头。给定一张RGB-D图像，与VI-Net类似，我们首先使用Mask-RCNN分割出感兴趣的物体。然后，我们从带有RGB特征\(F_c\)的反投影三维点云\(P \in \mathbb{R}^{n \times 3}\)中随机选取\(N\)个点，并利用这些点来估计平移和尺寸，如图2.D所示。

因此，我们方法的核心是针对更具挑战性的三维旋转估计任务进行开发。本质上，我们分别训练一个translation-size网络和一个旋转网络。对于translation-size网络，我们对尺寸和平移都采用L1损失

\[ \begin{equation}\label{eq10} L_{ts} = \lambda_t|t_{pred} - t_{gt}| + \lambda_s|s_{pred} - s_{gt}|. \end{equation} \]

对于三维旋转，我们直接预测一个9维的旋转矩阵，并通过L1损失对其进行优化

\[ \begin{equation}\label{eq11} L_R = |R_{pred} - R_{gt}|. \end{equation} \]

在训练过程中，在进行旋转估计之前，使用真实的平移和尺寸信息对点云进行中心化和归一化处理；而在推理过程中，则使用预测得到的尺寸和平移信息进行归一化处理。

4. Experiment

4.1. Experimental Setup

4.2. Comparison with State-of-the-Art Methods

Table 1. Quantitative comparisons of different methods for category-level 6D object pose estimation on REAL275 [44]. '^∗' denotes the CATRE [29] IoU metrics. The best results are in bold, and the second best results are underlined.
Method Name	Mean Shape Priors	REAL275
Method Name	Mean Shape Priors	IoU₇₅^∗	5° 2 cm	5° 5 cm	10° 2 cm	10° 5 cm
SPD [39]	✓	27.0	19.3	21.4	43.2	54.1
CR-Net [45]	✓	33.2	27.8	34.3	47.2	60.8
CenterSnap-R [17]	✓	-	-	29.1	-	64.3
ACR-Pose [11]	✓	-	31.6	36.9	54.8	65.9
SAR-Net [22]	✓	-	31.6	42.3	50.3	68.3
SSP-Pose [58]	✓	-	34.7	44.6	-	77.8
SGPA [3]	✓	37.1	35.9	39.6	61.3	70.7
RBP-Pose [57]	✓	-	38.2	48.1	63.1	79.2
SPD + CATRE [29]	✓	43.6	45.8	54.4	61.4	73.1
DPDN [25]	✓	-	46.0	50.7	70.4	78.4
FS-Net [5]	✗	-	-	28.2	-	60.8
DualPoseNet [26]	✗	30.8	29.3	35.9	50.0	66.8
GPV-Pose [8]	✗	-	32.0	42.9	-	73.3
SS-ConvNet [23]	✗	-	36.6	43.4	52.6	63.5
HS-Pose [59]	✗	-	46.5	55.2	68.6	82.7
IST-Net [28]	✗	-	47.5	53.4	72.1	80.5
VI-Net [27]	✗	48.3	50.0	57.6	70.8	82.1
SecondPose (Ours)	Semantic Priors	49.7	56.2	63.6	74.7	86.0

Figure 4. Qualitative comparison on REAL275 [44]. We compare our prediction with ground truth and the prediction of our baseline, VI-Net [27]. Our approach achieves significantly higher precision in rotation estimation.

Figure 5. Qualitative comparison on HouseCat6D [18]. We compare our prediction with ground truth and the prediction of our baseline, VI-Net [27].

Table 2. Overall and class-wise evaluation of 3D IoU(at 25%, 50%) on the dataset HouseCat6D [18]. The best results are in bold.
Approach	IoU₂₅ / IoU₅₀	Bottle	Box	Can	Cup	Remote	Teapot	Cutlery	Glass	Tube	Shoe
NOCS [44]	50.0 / 21.2	41.9 / 5.0	43.3 / 6.5	81.9 / 62.4	68.8 / 2.0	81.8 / 59.8	24.3 / 0.1	14.7 / 6.0	95.4 / 49.6	21.0 / 4.6	26.4 / 16.5
FS-Net [5]	74.9 / 48.0	65.3 / 45.0	31.7 / 1.2	98.3 / 73.8	96.4 / 68.1	65.6 / 46.8	69.9 / 59.8	71.0 / 51.6	99.4 / 32.4	79.7 / 46.0	71.4 / 55.4
GPV-Pose [8]	74.9 / 50.7	66.8 / 45.6	31.4 / 1.1	98.6 / 75.2	96.7 / 69.0	65.7 / 46.9	75.4 / 61.6	70.9 / 52.0	99.6 / 62.7	76.9 / 42.4	67.4 / 50.2
VI-Net [27]	80.7 / 56.4	90.6 / 79.6	44.8 / 12.7	99.0 / 67.0	96.7 / 72.1	54.9 / 17.1	52.6 / 47.3	89.2 / 76.4	99.1 / 93.7	94.9 / 36.0	85.2 / 62.4
SecondPose (Ours)	83.7 / 66.1	94.5 / 79.8	54.5 / 23.7	98.5 / 93.2	99.8 / 82.9	53.6 / 35.4	81.0 / 71.0	93.5 / 74.4	99.3 / 92.5	75.6 / 35.6	86.9 / 73.0

4.3. Limitations

4.4. Ablation Studies

Table 3. Ablation Study on REAL275 [44]. '^∗' denotes the CATRE [29] IoU metrics.
Row	Method	IoU₇₅∗	5° 2 cm	5° 5 cm	10° 2 cm	10° 5 cm
A0	SecondPose (baseline)	49.7	56.2	63.6	74.7	86.0
B0	w/o semantic	48.0	51.1	58.9	71.6	82.4
B1	w/o geometric	49.5	55.1	62.3	73.7	84.8
B2	w/o semantic+geometric	48.5	49.9	57.4	70.4	80.8
C0	w/o d in Eq. \(\eqref{eq1}\)	49.1	55.1	63.1	73.7	85.0
C1	w/o \(\alpha\) in Eq. \(\eqref{eq1}\)	49.3	54.7	62.8	73.1	84.7
C2	w/o \(\beta\) in Eq. \(\eqref{eq1}\)	49.6	54.8	62.7	74.6	86.7
C3	w/o \(\theta\) in Eq. \(\eqref{eq1}\)	49.5	55.1	63.1	74.2	85.6
D0	KNN Panel (10 nearest neighbors)	49.4	55.4	63.1	73.7	85.5
E0	random rotation 5°	49.7	56.1	63.4	74.6	85.9
E1	random rotation 10°	49.4	55.8	63.5	74.4	85.8
E2	random rotation 15°	48.5	55.4	63.0	73.9	85.4
E3	random rotation 20°	47.9	54.5	62.4	73.2	85.1
F0	manual occlusion n = 16	49.7	56.0	63.6	74.8	86.2
F1	manual occlusion n = 8	49.5	55.7	63.2	74.3	85.6
F2	manual occlusion n = 4	46.7	52.5	60.9	71.5	84.6
G0	random perturbation s = 0.002	49.7	56.1	63.6	74.6	85.8
G1	random perturbation s = 0.005	49.6	55.8	63.4	74.4	86.0
G2	random perturbation s = 0.01	45.9	53.7	62.6	73.4	86.1

5. Conclusion

Supplementary Material

A. Implementation Details

Table 4. Parameter Count
Number of Parameters	Trainable	Frozen
VI-Net	27,311,368	0
Ours	33,639,561	22,056,576

B. Further Explanations of the Pipeline

Figure 6. Feature Fusion We illustrate the fusion process with annotated approximately equivalent and approximately invariant features.

Figure 7. Feature Maps We fuse features of RGB, DINOv2 and HP-PPF into a 2D feature map that is approximately SE(3)-equivariant.

C. More Experimental Results on HouseCat6D

Table 5. Quantitative comparisons of different methods for category-level 6D object pose estimation on HouseCat6D [19].
Method	HouseCat6D
Method	IoU₇₅	5° 2 cm	5° 5 cm	10° 2 cm	10° 5 cm
FS-Net [5]	14.8	3.3	4.2	17.1	21.6
GPV-Pose [8]	15.2	3.5	4.6	17.8	22.7
VI-Net [27]	20.4	8.4	10.3	20.5	29.1
SecondPose (Ours)	24.9	11.0	13.4	25.3	35.7