This thesis presents a modular embedded perception- to-control pipeline for vision-based robotic manipulation and evaluates two approaches for a shape-sorting task with a Franka Emika FR3. The system runs on an NVIDIA Jetson AGX Orin with an eye-in-hand Intel RealSense D435i. A custom YOLOv11 detector recognizes four toy-object classes, and RGB–D centroids are back-projected to 3D and transformed into the robot base frame using Tsai–Lenz hand–eye calibration.
Two strategies are compared under identical perception: (i) an analytical state machine of motion primitives executed via Cartesian impedance control, and (ii) a reinforcement-learning policy trained in RLBench and deployed zero-shot on real hardware. The analytical method achieves robust real-world sorting and repeatable insertion-like placement within perception limits. The RL policy transfers for reaching and coarse placement, but does not reliably peg due to yaw estimation and localisation errors.
The thesis concludes with recommendations to improve sim-to-real insertion, including stronger 6D pose estimation, tighter calibration and depth filtering, and better-aligned control interfaces.