LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation

¹Agile Robots SE
²Amigos Robots
³Technical University of Munich, Germany
^*Equal Contribution

Accepted by IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025

Abstract

Learning dexterous manipulation from few-shot demonstrations is a significant yet challenging problem for advanced, human-like robotic systems. Dense distilled feature fields have addressed this challenge by distilling rich semantic features from 2D visual foundation models into the 3D domain. However, their reliance on neural rendering models such as Neural Radiance Fields (NeRF) or Gaussian Splatting results in high computational costs. In contrast, previous approaches based on sparse feature fields either suffer from inefficiencies due to multi-view dependencies and extensive training or lack sufficient grasp dexterity. To overcome these limitations, we propose Language-ENhanced Sparse Distilled Feature Field (LensDFF), which efficiently distills view-consistent 2D features onto 3D points using our novel language-enhanced feature fusion strategy, thereby enabling single-view few-shot generalization. Based on LensDFF, we further introduce a few-shot dexterous manipulation framework that integrates grasp primitives into the demonstrations to generate stable and highly dexterous grasps. Moreover, we present a real2sim grasp evaluation pipeline for efficient grasp assessment and hyperparameter tuning. Through extensive simulation experiments based on the real2sim pipeline and real-world experiments, our approach achieves competitive grasping performance, outperforming state-of-the-art approaches.

Summary

- LensDFF is a lightweight framework for few-shot dexterous robotic manipulation.

- It combines sparse 3D feature representations with language-guided semantic cues.

- Unlike dense methods, it works from a single view, avoiding costly multi-view rendering.

- Utilizes vision-language models to distill rich semantic features efficiently.

- Integrates grasp motion primitives to improve grasp stability and dexterity.

- Includes a real-to-simulation (real2sim) pipeline for scalable tuning and evaluation.

- Demonstrates strong performance in both simulation and real-world tests.

- Achieves competitive results with greater computational efficiency, enabling practical low-data robotic learning.

BibTeX

@misc{feng2025lensdfflanguageenhancedsparsefeature, title={LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation}, author={Qian Feng and David S. Martinez Lema and Jianxiang Feng and Zhaopeng Chen and Alois Knoll}, year={2025}, eprint={2503.03890}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2503.03890}, }

LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation

Video

We propose Language-ENhanced Sparse Distilled Feature Field (LensDFF), which efficiently distills view-consistent 2D features onto 3D points using our novel language-enhanced feature fusion strategy, thereby enabling single-view few-shot generalization.

Abstract

Summary

BibTeX