Google’s latest auto-captioning experiment and its deep fascination with artificial intelligence

Picture from Google's blog
Coming up with a description for an image well is hard even for humans, so Google’s latest adventure in auto-captioning is ground-breaking in its technological execution. But it also speaks to a deeper ethos within the company and their curiosity with artificial intelligence and robotics.

Try to caption an image of some pizza. How would you describe the image? Where it’s at, what it’s on, how many pizzas there are? There’s a bunch of different ways to go about captioning such an image, and Google is attempting to find a way to do it reliably using machines.

On Monday this week (November 17th) Google’s research blog posted an experimental process to increase the accuracy of translating an image into text by a computer.

It’s a process that Google says could “eventually help visually impaired people understand photos, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images”.

While highly technical, the auto-captioning capabilities take advantage of recent developments in the world of machine learning (identifying cat faces, of all things) and text association in speech recognition.

It’s a step forward in the creation of artificial intelligence; human communication, in terms of language, and operates on a vastly different level compared to that of machines. While language acts as a representation of reality (at least according to Austrian philosopher Ludwig Wittgenstein) for most of humanity, software language is literally reality for machines.

Bridging that gap sets a challenge in translating an image from something that a computer can understand into something that people can understand, and the processes Google has developed is scoring well in quantitative evaluations.

Machine learning

Yet while the technology is a breakthrough in itself, it’s the implications of what it can be used for that’s the more concerning part. Just earlier this year in October, Google acquired two Oxford University spin-off companies that specialises in machine learning and computer vision.

Spot how accurate the computer labels the photos without human help

It’s part of Google’s DeepMind, which seeks to build a working AI system similar to those portrayed in films – a system that works like a human brain, making decisions free of human interaction.

This means that the auto-captioning technology has potential beyond that of tagging and identifying the numerous photos uploaded onto Google’s Image Search database and being expanded into part of a recognition system for a potential real-life version of HAL (the computer).

And while Google’s intentions may not be inherently nefarious, ala “Don’t Be Evil”, an article by the Guardian from late 2013 sheds light on Google’s potential:

“What drives the Google founders is an acute understanding of the possibilities that long-term developments in information technology have deposited in mankind's lap. Computing power has been doubling every 18 months since 1956. Bandwidth has been tripling and electronic storage capacity has been quadrupling every year. Put those trends together and the only reasonable inference is that our assumptions about what networked machines can and cannot do need urgently to be updated.”