g-speak in Slices
A capsule history of modern computing might look something like so:
- the batch processing age.
- development of interactive systems: teletypes; time-sharing; terminals. the rise of the command line.
- transition to graphical user interfaces. character-mode and bitmapped applications. the personal computer. the spread and standardization of mouse-driven, two-dimensional window systems.
- networks. tcp/ip and udp in use almost everywhere.
That's a history from the perspective of adoption, organized according to the broad availability of a technology (rather than its first appearance in a lab). Each of the developments on the list is enabled by two things: new capabilities of computing hardware; and large amounts of software written to take advantage of those capabilities.
Both the spread of new hardware and the construction of a software ecosystem take time. Technological transitions are moderated by the installed base of hardware and software and mediated by users' expectations and programmers' assumptions.
The Macintosh, in 1984, introduced to the mass market a windows-and-mouse interface standard. This interface has evolved incrementally, but it is fair to say that a version of it continues to ship on every personal computer sold today. Internet connectivity has vastly expanded the capabilities of programs running within the GUI environment, but the modes of interaction are pretty much the same as those provided by the original Macintosh.
Today's computers, however, are capable of a much broader interactivity.
We've constructed a platform—hardware and software—for building and using applications that combine
- finely calibrated, free-hand gestural input and output
- multi-core, multi-screen and multi-machine application programming libraries and patterns
- real-space representation of all input objects and on-screen constructs
We call our platform g-speak. It is, in many ways, a new thing. Reasoning about new things is often easier if they are categorized, and if the category is named. So we call g-speak a spatial operating environment.
A spatial operating environment provides abstractions that help programmers to write applications that use gestural input, that function well on large screens and for simultaneous users, that work across multiple computers and screens, and that can be built from loosely-coupled small programs.
We built g-speak initially for ourselves. The ideas and various partial implementations date back a decade and a half. Early work took place at the MIT Media Lab. We formed Oblong Industries in 2006, with the goal of building g-speak out as a broadly useful platform.
Some of the big itches we have tried to scratch, over the years and up to the present, include
- Multi-user applications are hard to build. GUI toolkits assume a single pointing device.
- Similarly, interfaces that include more than one input device, tangible objects, gestural input, network control streams and other non-pointer input are difficult to build.
- Navigation through three-dimensional data spaces is always a kludge. People who play a lot of first-person video games get good at it. So do CAD afficionados. But even the most practiced users of mice and game-pads are always working against the indirection of their input devices.
- The mouse is a fairly constrained (and constraining) physical object. It has to sit on a surface (or be a surface, in the case of a trackpad). Heavy mouse use often leads to repetitive strain injuries. The mapping between the two axes of mouse motion and the two axes of on-screen pointer motion is indirect. It would be nice to bypass the mouse entirely and control pointer motion and position directly.
- On-screen layout in a traditional GUI environment is impoverished; the more pixels you have available, the more you feel this. Windows occlude other windows and there are almost no idioms for dynamic resize, rearrangement, decoration and redisplay of windows. GUI toolkits don't help you write multi-screen applications.
- On-screen feedback in a traditional GUI environment is sparse, too. As graphics cards have become more powerful, platforms have layered on animation and window decorations. But there are few provisions for consistent dynamic feedback about input state and options. The mouse mostly doesn't need this (cursor glyph changes and "tool tips" suffice). But more complicated input devices do.
- GUI applications aren't very network friendly. Even the venerable X Windows protocol has become less useful as toolkits have become more complicated. It's not easy to share input and control streams between computers in the same room. Protocols for coordinating fine-grained application state across multiple machines are not very mature (and no GUI toolkit addresses multi-machine operation).
- GUI applications aren't very composable. Toolkits support only monolithic styles of development: an application is a single process with one or a small number of graphics windows. Only one process can draw into a graphics window. It's almost impossible to write a set of small graphical applications that work together to perform complex tasks.
At a glance, this might seem like a grab-bag of complaints. But we've come to believe that a common thread runs through all of them.
We're dissatisfied with the the contemporary application development stack because the tools and toolkits we use today fail to take advantage of the graphical capabilities, horsepower and interconnectedness of modern machines. Addressing any of our complaints fully will require addressing all of them.
Perhaps the best argument for this is to look for an analogy in the last major interface shift, the move from character-mode computing to the mouse-driven GUI.
Making the most of the mouse, more affordable RAM, and improved displays of the early 1980s required rebuilding the personal computer from the ground up. The Macintosh team designed new hardware, knowing from the beginning that they were building a graphical operating system. The hardware work, os-level development, application toolkits and user interface design all happened as part of a single, unified effort. That effort produced the GUI that defined the next 25 years of computing. Other projects (at both Apple and other companies) took a more piecemeal approach and produced less impressive systems.
It's time for another redesign. We probably don't have to build our own motherboards this time around. And the unix-descended operating system kernels we use are pretty good. But there's a lot of work to do in the software stack that sits on top of the kernel. We want to redress the conversational imbalance that we have with our computers—great graphical output but very limited user input. We want all our applications to make use of all our computers. We want a common interface for all the screens we use every day: laptop and desktop computers, televisions, the nav systems in our car dashboards.
We started out building a framework for gestural I/O. (Though started, here, is a slippery idea. Our work on all of this goes back fifteen years or so.)
From the beginning we knew that brand new input and output languages would have to be designed at the same time; a richer control interface requires a new on-screen graphical vocabulary.
Again, this is most obvious when looking back at previous work. The PARC and Apple folks building new software around the mouse developed scrollbars, overlapping windows, pull-down menus, title bars, desktop icons and most of the other UI features we take for granted today.
To an applications programmer, the shift to gestural input is as big as the shift to the mouse was twenty-five years ago. It's both exciting and a little daunting. The g-speak input framework allows direct, either-handed, multi-person manipulation of any pixel on any screen you can see. Pointing is millimeter-accurate from across the room. Hand pose, angle, and position in space are all available at 100 hertz, with no perceptible latency and to sub-millimeter precision.
With all of this articulation available, it makes little sense to use GUI components that were built for the two-axis, three-button mouse.
The g-speak platform relies on a new system of on-screen primitives. There's more work to do, of course: the breadth of possible gestural interactions exceeds what we could do with earlier input devices.
A fully gestural inteface also requires a new geometric grounding. Contemporary GUI toolkits all bottom out in pixels. But gestural input is spatial. Pointing is literal and three dimensional. Programs draw on multiple screens.
So we don't think in pixels when writing g-speak applications. We think in millimeters, and in z as well as x and y. This takes some getting used to, but it's liberating. Even two-dimensional program interfaces—and most of the interfaces we write are still fundamentally two-dimensional—live in real-space with their users.
Gesture, graphics and spatial representation form a tight triad. And we pulled in two more building blocks from our earlier work.
First, we needed fine-grained and pervasive coordination between machines.
One of the really nifty things about gestural input is that from the user's perspective there is no input device at all. Hands work more or less as expected when pointing at pixels, both proprioceptively and socially. In particular, it's possible to point at and interact with many screens at once, which implies interaction with many computers at once.
There are practical limitations to how many screens can be driven by a single computer. And some screens—laptops, televisions—come bundled with a computer of their own. So we rebuilt the g-speak input pipeline around a network layer that makes it easy for lots of computers to join up and use the gestural event stream. This approach evolved into a style of programming that links processes together tightly but flexibly. An application, from a user's perspective, is just a collection of coherent programmatic interactions. There's no good reason to restrict the collection of interactions to a single computer. Computations—at the very least the interactive components of computation—need to move around the network.
To describe things using a venerable and useful vocabulary, we now routinely move models, views, and controllers around between machines. We are able to write applications that assume they will run on a collection of computers, rather than a single box.
The unix environment is a conceptual touchstone for us, here. Unix provides two fundamental abstractions that together offer a lot of leverage: every resource looks like a file and every process has standard input, output and error file descriptors. These two ideas enable programmers to design small, flexible utilities that combine to perform big, complicated tasks. Contemporary GUI toolkits aren't built on any such fundamental abstractions and don't enable small pieces loosely joined (to borrow and repurpose David Weinberger's nice phrase). As unix-steeped hackers, we've always felt this to be a great loss. And we're keen to go back to the future, to create a new pipes-and-files foundation for the graphical world, to develop late-binding, duck-typing, defadvice-like idioms for applications programming.
We design on top of a clean and increasingly scalable multi-participant data interchange mechanism. And we can cooperatively render from multiple processes (including across the network). We'll write more about all of this in future posts.
It's worth noting, before moving on, that there are benefits to our network-soluble programming approach even when only one machine is used at a time. Many of our cross-machine parallelisms are equally useful in a multi-core context.
The final fundamental component we turned out to need was time. Of course, you get time for free from the real world (unless you're programming in a purely functional language), but not in a particularly useful form.
Most of today's GUI applications are relatively static. They respond to user input, which is an intermittent and low-frequency event stream. They probably set up a few loop or idle timers, too, and do some light processing when those fire.
Our g-speak applications, for a variety of reasons, tend to have more going on.
Making effective use of a lot of pixels—pixels spread across multiple screens—requires careful application of motion, transition, animation and scaling. And because gestural input opens up new possibilities for visualizing complex data sets, we tend to have a great deal of information on screen at a time, so we're always juggling graphics card and processor resources. The gestural input pipeline carries more data than, for example, mouse and keyboard event streams. Finally, because g-speak applications are often built out of multiple coordinated processes (possibly running on different machines), there's a lot of data flying around asynchronously.
We have two major mechanisms for dealing with all of this: soft values and buffered data stores.
The soft values framework provides a standard way to set up and use time-dependent member variables. We use softs extensively in graphical classes. Standard soft varieties wrap functions that converge in linear and asymptotic fashions, cyclical repeaters like sine functions, and softs that are built from or facilitate combinations of other softs. Floats, colors and vects can all be soft. New soft classes are easily derivable.
The network framework described above is constructed on top of a standard, flexibly sized buffer implementation. All data flowing through a normal g-speak application is rewindable. Most of the time we design applications so that every data pool is big enough to buffer a few seconds of throughput. It's also common to use much bigger pools. Video and audio, for example, are obvious candidates for more buffering.
These rewindable buffers take up disk space. And depending on access patterns and the quality of the kernel's virtual memory system they can eat RAM, too. But having temporally ordered, backwards-accessible data pervasively available across processes makes it possible to build new kinds of applications.
Programming is in large part defined by the exercise of identifying, extracting, polishing and creating clean abstractions for patterns that start to appear over and over. At Oblong, we found ourselves writing a lot of code that was really just book-keeping for on-screen graphics and for plumbing-level data caching. We've pushed much of that down into the platform layer.
Building g-speak is a design exercise at three levels. Most obviously, there is a new graphical computing environment—a new look and feel, in our industry's argot. Those graphics are inseparable from an architecture that motivates and produces them. Finally, we design and use applications that run on top of this foundation.
Working with Watson
The goal of each Watson Experience Center—located in New York, San Francisco, and Cambridge—is to demystify AI and challenge visitor’s expectations through more tangible demonstrations of Watson technology. Visitors are guided through a series of narratives and data interfaces, each grounded in IBM’s current capabilities in machine learning and AI. These sit alongside a host of Mezzanine rooms where participants further collaborate to build solutions together.
The process for creating each experience begins with dynamic, collaborative research. Subject matter experts take members of the design and engineering teams through real-world scenarios—disaster response, financial crimes investigation, oil and gas management, product research, world news analysis—where we identify and test applicable data sets. From there, we move our ideas quickly to scale.
Accessibility to the immersive pixel canvas for everyone involved is key to the process. Designers must be able to see their ideas outside of the confines of 15″ laptops and prescriptive software. Utilizing tools tuned for rapid iteration at scale, our capable team of designers, data artists, and engineers work side-by-side to envision and define each experience. The result is more than a polished marketing narrative; it's an active interface that allows the exploration of data with accurate demonstrations of Watson’s capabilities—one that customers can see themselves in.
Under the Hood
Underlying the digital canvas is a robust spatial operating environment, g‑speak, which allows our team to position real data in a true spatial context. Every data point within the system, and even the UI itself, is defined in real world coordinates (measured in millimeters, not pixels). Gestures, directional pointing, and proximity to screens help us create interfaces that more closely understand user intent and more effectively humanize the UI.
This award-nominated collaboration with IBM is prototyped and developed at scale at Oblong’s headquarters in Los Angeles as well as IBM’s Immersive AI Lab in Austin. While these spaces are typically invite-only, IBM is increasingly open to sharing the content and the unique design ideas that drive its success with the public. This November, during Austin Design Week, IBM will host a tour of their Watson Immersive AI Lab, including live demonstrations of the work and a Q&A session with leaders from the creative team.
Can't make it to Austin? Contact our Solutions team for a glimpse of our vision of the future at our headquarters in the Arts District in Los Angeles.