【论文笔记】CRISP: Object Pose and Shape Estimation with Test-Time Adaptation
CRISP: Object Pose and Shape Estimation with Test-Time Adaptation
| 方法 | 类型 | 训练输入 | 推理输入 | 输出 | pipeline |
|---|---|---|---|---|---|
| CRISP | 任意级 | RGBD | RGBD | \(\mathbf{R}, \mathbf{t}\) + 物体模型 |
- 2025.04.06:
Abstract
1. Introduction
Towards addressing these issues, this paper presents three main contributions:
- We introduce CRISP, an object pose and shape estimation pipeline. CRISP combines a pre-trained vision transformer (ViT) backbone with a dense prediction transformer (DPT) and feature-wise linear modulation (FiLM) conditiong to estimate the 6D pose and shape of the 3D object from a single RGB-D image [31, 33]. CRISP is categoryagnostic (i.e., it does not require knowledge of the object category at test time).
- We propose an optimization-based pose and shape corrector that can correct estimation errors. The corrector is a bi-level optimization problem and we use block coordinate descent to solve it. We approximate the shape decoder in CRISP by an active shape model, and show that (i) this is a reasonable approximation, and (ii) doing so turns it into a constrained linear least squares problem, which can be solved efficiently using interior-point methods and yields just as good shapes as the trained decoder.
- We adapt a correct-and-certify approach to self-train CRISP and bridge any large domain gap. During selftraining, we use the corrector to correct for pose and shape estimation errors. Then, we assert the quality of the output of the corrector using an observable correctness certificate inspired by [36], and create pseudo-labels using the estimates that pass the certificate check. Finally, we train the model on these pseudo-labels with standard stochastic gradient descent. Contrary to [22, 30], we do not need access to synthetic data during self-training.