Deep Perception

(for manipulation)

Part 1

MIT 6.421

Robotic Manipulation

Fall 2023, Lecture 16

Follow live at https://slides.com/d/dy2gx20/live

(or later at https://slides.com/russtedrake/fall23-lec16)

Limitations of using geometry only

  • No understanding of what an object is.
    • "Double picks"
    • Might pick up a heavy object from one corner
  • Partial views
  • Depth returns don't work for transparent objects
  • ...
  • some tasks require object recognition! "pick the mustard bottles"

ImageNet: 14 Million labeled images

Released in 2009

A sample annotated image from the COCO dataset

Traditional computer vision tasks

What object categories/labels are in COCO?

Transfer learning

Something we couldn't have expected...

 

(Pre-)Training on ImageNet/COCO makes it easier to "learn" to recognize other objects

Fine tuning

source: https://d2l.ai/chapter_computer-vision/fine-tuning.html

image from https://arxiv.org/abs/2012.02055

Object classification \(\Rightarrow\) detection  (sliding window)

source: https://towardsdatascience.com/understanding-regions-with-cnn-features-r-cnn-ec69c15f8ea7

Faster R-CNN adds a "region proposal network"

source: https://www.analyticsvidhya.com/blog/2018/07/building-mask-r-cnn-model-detecting-damage-cars-python/

Pick up the mustard bottles...

  1. Segmentation + ICP => model-based grasp selection
  2. Segmentation => antipodal grasp selection

"Self-supervised" learning

Example: Text completion

No extra "labeling" of the data required!

GPT-4 is "just" doing next-word prediction

Example: SimCLR

https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html

Example: SimCLR

https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html

"Contrastive visual representation learning"

Masked Auto-Encoders (MAE)

Example: Monocular Depth Estimation

 

Foundation models

quick experiments using CLIP "out of the box" by Kevin Zakka

Segment Anything (Meta)

Segment Anything

Open-source release doesn't accept text.  You need a wrapper... e.g. Grounded Segment Anything 

6D Object Pose Estimation Challenge

  • Until 2019, geometric pose estimation was still winning*.
  • In 2020, CosyPose: mask-rcnn + deep pose estimation + geometric pose refinement was best.

* - partly due to low render quality?

Lecture 16: Deep Perception (part 1)

By russtedrake

Lecture 16: Deep Perception (part 1)

MIT Robotic Manipulation Fall 2023 http://manipulation.csail.mit.edu

  • 889