Understanding Language is the Next Frontier for AI

16 Feb 2021

In recent years we have seen a lot of breakthroughs in AI. We now have deep learning algorithms beating the best of the best in games like chess and go. In computer vision these algorithms now recognise faces with the same accuracy as humans. Except they don’t, they can do it for millions of faces while humans struggle to recognize more than a few hundred people.

Board games are artificial tasks with only a few types of board pieces and rules that are simple enough to teach them to young children. It is no surprise that researchers chose this as a starting ground in the early days of AI research.

In 1955 Arthur Samuel was able to build the first ‘reinforcement’ algorithm that could teach itself to play checkers. Simply by playing against itself it was able to go from the level of a 3-year old, moving pieces randomly on a board, to the point where even Samuel himself could no longer win.

The most recent breakthrough in this domain happened in 2016 when Google’s AlphaGo was able to beat go world champion Lee Sedol. If Google had taken the same approach for go as IBM did in the 90s for chess, it would have had to wait another 10 years before this was possible. IBM’s DeepBlue ran on a supercomputer that was able to compute every possible move and counter move on the 8 by 8 chess board. With Moore’s law in mind, it would have taken until at least 2025 before we would have a supercomputer that could do the same for the 19 by 19 grid on the Go board.

What Google did instead was go back to reinforcement algorithms that could learn by playing against different versions of itself. So even though AlphaGo could not compute every possible move on the board, it was able to see patterns that Lee Seedol couldn’t.

Because these board games have such simple rules, it was really easy for Google to set up a simulation environment where these algorithms could play against slightly different versions of themselves. In less than an hour they could gather more data to learn from than any professional player was able acquire in a lifetime. A few weeks of self-play delivered AlphaGo more data to learn from than all go players in history combined. It is this high volume of high-quality data that enabled AlphaGo to beat the world champion.

The reason we do not see Google’s algorithms controlling factories is that they require insane amounts of data to learn from. We cannot really afford to have our machines try out many different combinations of parameters to see which configurations give us the highest yield and revenue.

So, if we want AI to be better than us, we need at least vast amounts of high-quality labelled data. But there is more to it than that if we go to more complex tasks.

It took until 2015 until we were able to build an algorithm that could recognize faces with an accuracy that is comparable to humans. Facebook’s DeepFace is 97,4% accurate, just shy of the 97.5% human performance. For reference, the FBI algorithm used in Hollywood only reaches 85% accuracy, which means it is still wrong in more than 1 out of every 7 cases.

The FBI algorithm was handcrafted by a team of engineers. For each feature, like the size of a nose and the relative placement of your eyes was manually programmed. The Facebook algorithm works with learned features instead. They used a special deep learning architecture called Convolutional Neural Networks that mimics the different layers in our visual cortex. Because we don’t know exactly how we see, the connections between these layers are learned by the algorithm.

The breakthrough Facebook pulled off here is actually twofold. It was not only able to learn good features, but they also gathered high quality labelled data from the millions of users who were kind enough to tag their friends in the photos they shared.

Vision is a problem that evolution has solved in millions of different species ranging from the smallest insects to the weirdest sea creatures. If you look at it that way, language seems to be much more complex. As far as we know, we are currently the only species that communicates with a complex language.

So, you need large quantities of high-quality labelled data and a smart architecture for AI to learn human level tasks.

Less than a decade ago, AI algorithms only counted how often certain words occurred in a body of text to understand what it was about. But this approach clearly ignores the fact that words have synonyms and only mean anything within context.

It took until 2013 for Tomas Mikolov and his team at Google to discover how to create an architecture that can learn the meaning of words. Their word2vec algorithm mapped synonyms on top of each other, it was able to model meaning like size, gender, speed, … and even learn functional relations like countries and their capitals.

The missing piece however was context. It was 2018 before we saw a real breakthrough in this field when Google introduced the BERT model. Jacob Devlin and team were able to recycle an architecture typically used for machine translation so it can learn the meaning of words in relation to their context in a sentence.

By teaching the model to fill out missing words they were able to embed language structure in the BERT model. With only limited data they could finetune BERT for a multitude of tasks ranging from finding the right answer to a question to really understanding what a sentence is about.

In 2019 researchers at Facebook were able to take this even further. They trained a BERT-like model on more than 100 languages simultaneously. It was able to learn tasks in 1 language, for example English, and use it for the same task in any of the other languages like Arabic, Chinese, Hindi and so on. This language agnostic model has the same performance as BERT on the language it is trained on and there is only a limited impact going from 1 language to another.

All these techniques are really impressive in their own right, but it took until early 2020 for researchers at Google to beat human performance on select academic tasks. Google pushed the BERT architecture to its limits by training a much larger network on even more data. This so-called T5 model now performs better than humans in labelling sentences and finding the right answers to a question. The language agnostic mT5 model released in October is almost as good as bilingual humans at switching from 1 language to another, but on 100+ languages at once.

Now what is the catch here? Why aren’t we seeing these algorithms everywhere? Training the T5 algorithm costs around $1.3 million in cloud compute. And although Google was so kind as to share these models, they are so big that they take ages to fine tune for your problem.

Since we are still more or less on track with Moore’s law, compute power will continue to double every 2 years. We will see human level language understanding in real applications and in more than 100 different languages soon. But don’t expect to see these models in low latency tasks any time soon.

If you want to try out the current state-of-the-art of language agnostic chatbots, why don’t you give by Sinch a try. On this bot platform you can build bots in your native tongue and they will understand any language you (or Google Translate) can.

Behind the Tracks