Learning Actions from Human Demonstration Video for Robotic Manipulation

September 10, 2019 · Declared Dead · 🏛 IEEE/RJS International Conference on Intelligent RObots and Systems

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Shuo Yang, Wei Zhang, Weizhi Lu, Hesheng Wang, Yibin Li arXiv ID 1909.04312 Category cs.CV: Computer Vision Cross-listed cs.RO Citations 27 Venue IEEE/RJS International Conference on Intelligent RObots and Systems Last Checked 4 months ago

Abstract

Learning actions from human demonstration is an emerging trend for designing intelligent robotic systems, which can be referred as video to command. The performance of such approach highly relies on the quality of video captioning. However, the general video captioning methods focus more on the understanding of the full frame, lacking of consideration on the specific object of interests in robotic manipulations. We propose a novel deep model to learn actions from human demonstration video for robotic manipulation. It consists of two deep networks, grasp detection network (GNet) and video captioning network (CNet). GNet performs two functions: providing grasp solutions and extracting the local features for the object of interests in robotic manipulation. CNet outputs the captioning results by fusing the features of both full frames and local objects. Experimental results on UR5 robotic arm show that our method could produce more accurate command from video demonstration than state-of-the-art work, thereby leading to more robust grasping performance.