咕咕嘎嘎-NonPrehensile-1-HAMNET

咕咕嘎嘎 2026-03-30

咕咕嘎嘎-NonPrehensile-1-HAMNET

2026-03-30

咕咕嘎嘎

开一个咕咕嘎嘎栏目记录科研中的一些阅读/获取/心得.

之前的读论文感觉形式太死了，感觉这样更灵活一点. 然后语言上可能也不会局限于之前用英文总结了，而是想中文说话就中文，想英文说话就英文.

也充当一个自言自语的空间.

UNICORN: https://unicorn-hamnet.github.io/static/pdf/paper.pdf

这篇文章大概就是介绍了一个目前看来最强大的一个 nonprehensile 的结构，能够做到 object & action & environment 的 generalization.

有一说一，感觉这篇文章 introduction 写的巨详细. 在 Introduction 中，提到了 inspired by human brain modularity motor, 提出了一个 Hierarchical Modular 的模型，其中有一个 modulation network 用来决定去 activate 哪些 module (HAMNET). 其实大概就是那些 MoE 的思想，但是好像又不是特别一样，具体的等会儿说；然后还提到这个 geometry representation，由于环境不同，所以需要学一个能够学 contact between two arbitrary objects 的 representation (which seems to be quite important and useful). 然后最后一个就是，在 environment design 选择了用很多个 primitive 去构造 environment 来做对 environment 的泛化.

训练上，大概的总结是：Leverage deep RL using parrel simulation to train modular policy using pre-trained geometric representation on procedurally generated environment. 当然最后有一个 teacher-student distillation.

整个 policy 的设计都在 Fig.7 & 8 中. 下面只是一个复述.

从后往前看. 首先是 base network，即最后用来 output action 的部分. Base network 先做了 (O, object + hand + goal) 的 cross attention, 和 prev action 和 phys concat 起来后就是经过 MLP 得到 actor / critic. 这个 MLP 比较特殊，每一层都有 \(m\) 个 module 组成 \(\theta_i\)，即 \(m\) 组不同的权重，然后 forward 的时候通过 modulation network 组合这些 module 算出这一层的真实值 \(\theta=f(\theta_i)\).

然后是 modulation network: 首先做 (G, joint+object state) 和 (L, joint+object state) 的 cross attention，然后和 prev action 和 phys 和 O concat 起来后通过 MLP 得到 \(z\) (称为 modulation embedding). \(z\) 被 map 到一个 module-wise activation factor \(w\) (after softmax) 和 feature-wise activation factor \(g\) (一个 gate). 这个 \(w,g\) 就可以用来计算 \(\theta_j=g_j\odot \sum_i w_{i,j}\theta_{i,j}\). 当然，对于 actor/critic, \(g,w,\theta\) 都是不同的.

最后来看 geometric representation (UNICORN). Geometric Representation 在后续的训练是 frozen 的. Pre-training 的流程是，搞两个 object \(A,B\), 然后放在一个 near-contact configuration, 然后 predict whether the i-th patch of A is in contact with B. 而模型大概就是，每个点云可以通过 patch 类 pointnet++ 的模型来获得 global 和 patch feature，然后判断是否 contact 的 decoder 就用 A 的 patch feature 和 B 的 global feature.

然后一些 RL training 上的：用了一个 curriculum learning，其中一开始的时候 arm 就在 object 旁边，然后 ceiling 比较高；然后后面增加 random initialization & tight ceiling 的占比. 然后别的倒没什么，开了 1024 个环境摁跑.

这篇文章大抵就是这样. 然后后一天又看了两篇大规模 RL 训练的改进相关的（SAPG 和 CPO），但也都只看了个皮毛. 可能如果需要更深入研究的话会放到之后的咕咕嘎嘎栏目？