Automatic Image Captioning system created by Microsoft Research interns

Sean Cameron

Microsoft Research Automatic Captioning technique

The field of machine intelligence is notoriously difficult to either research or quantify. What is defined as intelligence by one scientific discipline can often differ significantly from another, and as such what can be measured as quantifiable progress varies enormously. With regards to machine or ‘artificial’ intelligence, there are numerous tests available, each which purports to be the most accurate measure of ‘intelligence’.

These range from the famous Turing test, which challenges a machine capable of natural speech to sort nonsense from truth, to the less well known BLEU or METEOR metrics, which measure the ability to generate accurate descriptions. Several significant issues reside within each of these however; a machine capable of projecting a convincing enough illusion of intelligence can pass the Turing test, while the BLEU and METEOR metrics are accused of being a little too narrow in their focus.

Despite these niggles and quibbles, the field of machine intelligence has made great strides in the last decade and this seems set to continue, with recent work by a certain group of Microsoft Research interns being of particular interest. Working together across the summer, the team of twelve interns and researchers managed to create an Automatic Image Captioning system.

Microsoft Research Automatic Captioning technique

This achievement is made all the more remarkable given the field in which it was made. Teaching a machine to understand an image is one thing, however creating a program that can understand an image in a binary way and then ‘translate’ the information it obtained into something a human could read, let alone make sense of is something else altogether. That is exactly what the team managed, this is another important stepping stone along the road to ‘true’ machine intelligence.

That isn’t to say that the program isn’t without its quirks. Frequently, when pitted against a human to provide an accurate description of an image, the program failed to provide a sufficient detail compared to the human, even if the machine managed to out-compete when measured using the BLEU metric, and achieve a similar score using METEOR.

Deputy Managing Director John Platt notes,

This type of collective progress is just awesome to see. Image captioning is a fascinating and important problem, and I would like to better understand the strengths and weaknesses of these approaches. (I note that several people used recurrent neural networks and/or LSTM models). As a field, if we can agree on standardized test sets (such as COCO), and standard metrics, we’ll continue to move closer to that goal creating a system that can automatically generate descriptive captions of an image as well as a human. The results from our work this summer and from others suggests we’re moving in the right direction.

With similar advancements in facial recognition technology being made by Facebook, the field of ‘deep’ intelligence looks to be an area of major growth for at least the next decade. What other advancements will be made is yet to be seen, what is sure it that though this step is small, it is significant.