The Turing Test is obsolete. It’s time to build a new barometer for AI

By Rohit Prasad

December 28, 2020

 

This year marks 70 years since Alan Turing published his paper introducing the concept of the Turing Test in response to the question, “Can machines think?” The test’s goal was to determine if a machine can exhibit conversational behavior indistinguishable from a human. Turing predicted that by the year 2000, an average human would have less than a 70% chance of distinguishing an AI from a human in an imitation game where who is responding—a human or an AI—is hidden from the evaluator.

Why haven’t we as an industry been able to achieve that goal, 20 years past that mark? I believe the goal put forth by Turing is not a useful one for AI scientists like myself to work toward. The Turing Test is fraught with limitations, some of which Turing himself debated in his seminal paper. With AI now ubiquitously integrated into our phones, cars, and homes, it’s become increasingly obvious that people care much more that their interactions with machines be useful, seamless and transparent—and that the concept of machines being indistinguishable from a human is out of touch. Therefore, it is time to retire the lore that has served as an inspiration for seven decades, and set a new challenge that inspires researchers and practitioners equally.

The Turing Test and the popular imagination

In the years that followed its introduction, the Turing Test served as the AI north star for academia. The earliest chatbots of the ’60s and ’70s, ELIZA and PARRY, were centered around passing the test. As recently as 2014, chatbot Eugene Goostman declared that it had passed the Turing Test by tricking 33% of the judges that it was human. However, as others have pointed out, the bar of fooling 30% of judges is arbitrary, and even then the victory felt outdated to some.

Still, the Turing Test continues to drive popular imagination. OpenAI’s Generative Pre-trained Transformer 3 (GPT-3) language model has set off headlines about its potential to beat the Turing Test. Similarly, I’m still asked by journalists, business leaders, and other observers, “When will Alexa pass the Turing Test?” Certainly, the Turing Test is one way to measure Alexa’s intelligence—but is it consequential and relevant to measure Alexa’s intelligence that way?

To answer that question, let’s go back to when Turing first laid out his thesis. In 1950, the first commercial computer had yet to be sold, groundwork for fiber-optic cables wouldn’t be published for another four years, and the field of AI hadn’t been formally established—that would come in 1956. We now have 100,000 times more computing power on our phones than Apollo 11, and together with cloud computing and high-bandwidth connectivity, AIs can now make decisions based on huge amounts of data within seconds.

While Turing’s original vision continues to be inspiring, interpreting his test as the ultimate mark of AI’s progress is limited by the era when it was introduced. For one, the Turing Test all but discounts AI’s machine-like attributes of fast computation and information lookup, features that are some of modern AI’s most effective. The emphasis on tricking humans means that for an AI to pass Turing’s test, it has to inject pauses in responses to questions like, “do you know what is the cube root of 3434756?” or, “how far is Seattle from Boston?” In reality, AI knows these answers instantaneously, and pausing to make its answers sound more human isn’t the best use of its skills. Moreover, the Turing Test doesn’t take into account AI’s increasing ability to use sensors to hear, see, and feel the outside world. Instead, it’s limited simply to text.

To make AI more useful today, these systems need to accomplish our everyday tasks efficiently. If you’re asking your AI assistant to turn off your garage lights, you aren’t looking to have a dialogue. Instead, you’d want it to fulfill that request and notify you with a simple acknowledgment, “ok” or “done.” Even when you engage in an extensive dialogue with an AI assistant on a trending topic or have a story read to your child, you’d still like to know it is an AI and not a human. In fact, “fooling” users by pretending to be human poses a real risk. Imagine the dystopian possibilities, as we’ve already begun to see with bots seeding misinformation and the emergence of deep fakes.

New meaningful challenges for AI

Instead of obsessing about making AIs indistinguishable from humans, our ambition should be building AIs that augment human intelligence and improve our daily lives in a way that is equitable and inclusive. A worthy underlying goal is for AIs to exhibit human-like attributes of intelligence—including common sense, self-supervision, and language proficiency—and combine machine-like efficiency such as fast searches, memory recall, and accomplishing tasks on your behalf. The end result is learning and completing a variety of tasks and adapting to novel situations, far beyond what a regular person can do.

This focus informs current research into areas of AI that truly matter—sensory understanding, conversing, broad and deep knowledge, efficient learning, reasoning for decision-making, and eliminating any inappropriate bias or prejudice (i.e. fairness). Progress in these areas can be measured in a variety of ways. One approach is to break a challenge into constituent tasks. For example, Kaggle’s “Abstraction and Reasoning Challenge” focuses on solving reasoning tasks the AI hasn’t seen before. Another approach is to design a large-scale real-world challenge for human-computer interaction such as Alexa Prize Socialbot Grand Challenge—a competition focused on conversational AI for university students.

In fact, when we launched the Alexa Prize in 2016, we had intense debate on how the competing “socialbots” should be evaluated. Are we trying to convince people that the socialbot is a human, deploying a version of the Turing Test? Or, are we trying to make the AI worthy of conversing naturally to advance learning, provide entertainment, or just a welcome distraction?

We landed on a rubric that asks socialbots to converse coherently and engagingly for 20 minutes with humans on a wide range of popular topics including entertainment, sports, politics, and technology. During the development phases leading up to the finals, customers score the bots on whether they’d like to converse with the bots again. In the finals, independent human judges assess for coherency and naturalness and assign a score on a 5-point scale—and if any of the social bots converses for an average duration of 20 minutes and scores 4.0 or higher, then it will meet the grand challenge. While the grand challenge hasn’t been met yet, this methodology is guiding AI development that has human-like conversational abilities powered by deep learning-based neural methods. It prioritizes methods that allow AIs to exhibit humor and empathy where appropriate, all without pretending to be a human.

The broad adoption of AI like Alexa in our daily lives is another incredible opportunity to measure progress in AI. While these AI services depend on human-like conversational skills to complete both simple transactions (e.g. setting an alarm) and complex tasks (e.g. planning a weekend), to maximize utility they are going beyond conversational AI to “Ambient AI”–where the AI answers your requests when you need it, anticipates your needs, and fades into the background when you don’t. For example, Alexa can detect the sound of glass breaking, and alert you to take action. If you set an alarm while going to bed, it suggests turning off a connected light downstairs that’s been left on. Another aspect of such AIs is that they need to be an expert in a large, ever-increasing number of tasks, which is only possible with more generalized learning capability instead of task-specific intelligence. Therefore, for the next decade and beyond, the utility of AI services, with their conversational and proactive assistance abilities on ambient devices, are a worthy test.

None of this is to denigrate Turing’s original vision—Turing’s “imitation game” was designed as a thought experiment, not as the ultimate test for useful AI. However, now is the time to dispel the Turing Test and get inspired by Alan Turing’s bold vision to accelerate progress in building AIs that are designed to help humans.

 

Rohit Prasad is vice president and head scientist of Alexa at Amazon.

(49)