Jarvis

December 2023
OpenCV, Python, Mediapipe, Flask, Next.js, Typescript, Three.js
Featured on OpenCV Live

Overview

We built Jarvis for our design project course at the University of Waterloo. Inspired by Iron Man, it combines computer vision and optical illusions into a gesture-controlled 3D hologram. Building this was a really fun exercise trying to bring sci-fi to reality, and the finished product was really cool to interact with.

3D Models
Render 3D models with a holographic view.
Interactivity
Interactive and gesture-controlled model manipulation.
Voice Commands
Voice commands for ease of use and immersion.

Building Jarvis

Architecture

We used a Flask backend running a WebSockets server and OpenCV thread in parallel, connected to a Next.js frontend powered by Three.js. Here's the architecture diagram:

Project Architecture

Backend

Our goal was to achieve accurate 3D hand tracking and gesture recognition. We originally planned to use a Leap Motion Controller which uses infrared sensors and cameras for highly accurate hand tracking. However, due to project constraints, we opted for a simpler, more manual approach.

Using the Mediapipe library, we were able to track hand position in 2 dimensions relatively easily by looking at the coordinates of a specific hand landmark relative to the camera FOV. Depth perception was a little trickier. By measuring the distance between two hand landmarks (index 0 to 5) that maintained a relatively constant distance across various gestures, we achieved surprisingly accurate depth values that were then scaled linearly to provide real-world measurements.

Handtracking

To ensure accuracy, we corrected camera distortion using OpenCV functions, including findChessboardCorners, calibrateCamera, getOptimalNewCameraMatrix, and undistort.

The backend pipeline continuously processed camera frames, extracting 3D coordinates and gesture data from Mediapipe models in real time. This data was sent via WebSockets to our frontend, all running on a multithreaded Flask server.

Frontend

The Next.js app displayed and manipulated interactive 3D models based on the position and gesture data provided by the backend. We used Three.js alongside various pmndrs libraries and assets from Sketchfab to perform model manipulation.

We integrated voice commands and text-to-speech functionalities through the Web Speech API and ElevenLabs. The Web Speech API enabled users to control Jarvis using natural voice commands. After saying "Hey Jarvis," users could perform certain actions allowing for higher immersion. Text-to-speech feedback added an extra layer of interactivity, confirming commands in real time.

Projecting these visuals onto hologram hardware worked because of the black background, which created the illusion of floating 3D elements.

Hardware

To achieve the holographic effect, we leveraged a Pepper's Ghost illusion. By projecting a downward-facing monitor onto a 45-degree angled piece of acrylic, we created the effect of a floating holographic image!

Our setup included a custom wooden frame to support the monitor and acrylic. We painted the frame entirely black to allow for a better illusion. For our eyes, we placed a webcam facing upwards underneath the user's hands so that there wouldn't be anything in the way of the display.

Hardware Setup

Results

Watch our full project demonstration or check out the demo video:

The final working hologram was incredible to witness in person. Gesture-controlled interactions, accurate 3D tracking, and voice commands all contributed to a polished and engaging experience.

Additionally, we got the attention of the CEO of OpenCV and we were invited to present the project during OpenCV Live.

Akshar Barot · 12/20/2023