Put What Where?

Like many people, my husband and I have been increasing our home’s intelligence as of late. We can vacuum the house while we’re gone, program the lights to simulate our presence, and turn the heat up at 3am without leaving our bed. Santa brought our home a cerebral cortex to coordinate all these smart gadgets. I won’t divulge which digital assistant he brought, but I can say it’s not Jarvis.

As I think about what we might automate and voice control next, I can’t help but think about what I can’t automate: “Alexa, put that there”. Alexa is going to choke on that one. I’m sure this might not bother many people, but as a developer who has worked with an SDK for 8 years which enables coordinating the physical and digital worlds into the same space as you, it bothers me. In an advertisement for Google Home, a little boy asks Google to show him a star system “on the TV”. Knowing modern families, they likely have more than one TV, and in reality, Google would have responded with the dreaded “which TV?” and then slowly and methodically, listed off every TV in the house. I want that little boy to point at the TV and say “that one.”

The software behind my home’s automation is still missing two critical pieces of information that we as humans learn before we can talk—space and time. Let’s talk about space first. A child, before she can say “car”, can point to one when asked. Inversely, without understanding the concept of a question, can point to something in the room and expect an adult to tell her what it is. My husband is excited for a digital assistant so he can ask it questions about the world with our son, but it won’t be able to answer the very first question my son is capable of asking with only a gesture, “what is that?”

Pointing is an innate and primary form of learning. Mattie’s son Grayson points to objects to learn about the world around him.

Oblong’s g-speak SDK builds the real world location, size, and orientation into the core of every object, which allows us to calculate the “where.” Looking back at the Google Home example, if the system knows where the boy is, the direction he’s pointing, and where all the TVs are located in the house, then it can compute the television intersected by the directional ray extended from his pointing finger. In g-speak, we call this intersection calculation a “WhackCheck.” Voilà, we know where “there” is.

But what about “that?” That’s where time comes in. Yes, you can ask Siri® to set an alarm at a given time or remind you to take the turkey out of the oven in 4 hours, but there’s a subtle issue with time in the request to “put that there.”  Siri® needs to know the mapping of time and space to determine where you were pointing when you said “that” vs. where you were pointing when you said “there.”

Our version of Put That There using g-speak, the Google Cloud Speech API, and a Leap Motion device.

This past summer, we lovingly threw together a re-enactment of Chris Schmandt’s 1979 “Put That There” demo using g-speak, the Google Cloud Speech API, and a Leap Motion device. Ignoring issues with false hand and voice recognition data, the only challenge we needed to “hack” around was that most of the current cloud-based speech to text services fail to provide timing information for specific words.  You stream or upload voice audio, and receive back the text, but there is no direct way to determine where the user was pointing when they said “that” without knowing when they said it. Time takes a bit of a back seat in today’s public SDKs. In g-speak, we place time at equal footing with space so that every event / action / reaction is associated with time and can be played back.

Maybe building my own Jarvis isn’t such a bad idea. We have devices that can scan a room and create a 3d model. We have increasingly better voice recognition software and AI to understand what our questions mean. We have hand-tracking devices that work decently well in indoor lighting. And we have g-speak to bring it all together. It’s on, Zuckerberg.