Microsoft Research developing machine intelligence that can evaluate images and provide answers

Mark Coppock

Microsoft Research is all about machine intelligence. There’s Cortana, of course, which is powered by Bing. There’s Project Oxford that delves into all sorts of image and sound identification possibilities. And now we have Microsoft Research working with Carnegie Mellon University to teach machines to more deeply analyze images and act much more human.
The current work builds on previous efforts to automatically caption images, which involved a system that can recognize the elements in a scene and provide meaning by captioning an image in the same way as a human might do. That’s only a first step, however, because a caption does not provide context for how to act within the context of what a scene involves.

There's a lot going on in this scene, from the dog in the basket to the obstacles in front and around a rider.
There’s a lot going on in this scene, from the dog in the basket to the obstacles in front and around a rider.

The new system goes further, combining computer vision, deep learning, and language understanding to identify the elements of a scene and understand relationships between them. One example given is a system mounted on a bicycle that continuously evaluates the surroundings and asks questions that could be pertinent to a human rider.

The system could power all kinds of applications, such as a warning system for bicyclists. With a mounted camera continuously taking in the environment around the cyclist, the system would keep asking itself questions such as, “What is in the left side behind me?” or “Are any other bikes going to pass me from the left?” or “Are there any runners close to me that I might not see?”
The answers could then be automatically translated as suggestions to the biker, such as giving directional recommendations to avoid accidents. The answers can be played back to the biker via a speech synthesizer.

Microsoft researchers Xiaodong He, Li Deng, and Jianfeng Gao, members of the team at the company’s Deep Learning Technology Center, have been working with Carnegie Mellon researchers Zichao Yang and Alex Smola to develop the technology. The challenge with the technology is duplicating the human brain’s ability to use language to describe visual information and act upon it.
The applications are likely limitless, from medical diagnosis to driverless cars to drones. Anything where a computer system can evaluate a situation, use language to describe it, and then answer questions to act upon the situation in a logical fashion is a likely candidate.

“We’re using deep learning in different stages: to extract visual information, to represent the meaning of the question in natural language, and to focus the attention onto narrower regions of the image in two separate steps in order to seek the precise answer,” says Deng.

The system can answer questions like, "What stands between the two blue lounge chairs on an empty beach?"
The system can answer questions like, “What stands between the two blue lounge chairs on an empty beach?”

The team has developed a research paper that delves into the system, and it’s a fascinating read for anyone who’s interested in machine learning and natural language processing. The technology involved is complex and involves the kind of deep neural networks that are likely powering many of Microsoft’s machine intelligence projects.
Going forward, this is the kind of research that will result in significant advancements in medicine, engineering, driverless cars and the like. At the same time, it’s also what will eventually give rise to the singularity and the subjugation of mankind to our artificial intelligent overlords.