About a year ago, when we were very much in stealth, we did a small user study of an early build of our app. Our goal was to test our capabilities in finding movies, which meant that the build only had movies. Segue to me sitting with a user and them asking their very first query of “what is your name?”. Lo and behold, the app responded About a year ago, when we were very much in stealth, we did a small user study of an early build of our app. Our goal was to test our capabilities in finding movies, which meant that the build only had movies. Segue to me sitting with a user who asked their very first query: “What is your name?” Lo and behold, the app responded:
“Your Name is a great anime from 2016”.
Which was not what they expected. While the movie "Your Name" may have been what they wanted if they just said, "Your Name," what they really expected was for our assistant to respond with its own name. Even if you tell a user the scope of domains supported, they will think that only means that if you ask something outside that scope, it will be "recognized" for what it is – unsupported. Much as I would like to think otherwise, this assumption is not only natural, but fair. And so began the long quest of solving Conversational AI’s dirty secret: How does your model know when it shouldn’t respond?
One of the sad truths about modern AI (a misnomer, perhaps, as demonstrated by this article) is that most deep models are essentially what I like to call Generalizing Parrots. You can take a look at any recent paper where a tech giant spends a few million dollars training a model on huge amounts of data with an order of magnitude more parameters, and gets better results. At their core, these language models are shown a lot of (text) data with the goal of somehow teaching them to understand language. Intuitively, it would make sense that the larger the model, the more data it is shown, the more it will understand about language. But things may be getting out of hand. Take a look at this fun graph from our field from December 2019, showing the size of language models (where in general the larger the model the better the performance).
NVIDIA really showed the new limits of scaling there, right? And yet just a few months later, in February 2020, Microsoft required everyone to rescale the axis of the chart:
This new, new upper limit on model size now looks minuscule compared to what OpenAI released with GPT-3, which has 10x the number of parameters.
While I won’t get into just what these models enable (GPT3 in particular is fascinating), it still remains a very open question (in my opinion) of if these models are actually learning language or just really good at cheating. This quote from Scott Alexander sums up my thoughts on GPT3 rather well:
Speaking of which – can anything based on GPT-like principles ever produce superintelligent output? How would this happen? If it’s trying to mimic what a human can write, then no matter how intelligent it is “under the hood”, all that intelligence will only get applied to becoming better and better at predicting what kind of dumb stuff a normal-intelligence human would say. In a sense, solving the Theory of Everything would be a failure at its primary task. No human writer would end the sentence “the Theory of Everything is…” with anything other than “currently unknown and very hard to figure out”.
Which brings me back to my original point: the model is always going to respond to its prompt, even if it makes no sense, and Conversational AI is no different. A naively trained model, even if it gets 100% on all of your metrics, will completely fail to handle a query outside of its domain of expertise. This conundrum is known as the “Out Of Domain” problem.
Introducing the OOD Problem
Informally stated, the Out Of Domain Problem (OOD) is the question of determining whether input is In Domain (can be handled somehow) or Out Of Domain (in which case the model should respond that it cannot respond). For a human this may seem easy - if you spoke to me in Aramaic, I would tell you I have no idea what you are saying, no matter what you said. A parrot would instead think that sounds an awful lot like X in English, and will consequently respond Y in English. The fact that this problem is so non-trivial can be seen in the fact that Google only officially rolled out this feature in April of 2020.
In the literature this problem is still very much unsolved. A baseline paper from 2019 on the “out of scope problem” (as they defined it), performed an initial survey of different possible methods to solve this problem. Their findings? A recall on out-of-scope data of around 50%. Hardly solved. While I agree baseline papers that present a problem and a sample dataset are typically used to encourage far superior results, it is rare for the eventual results to be at least 30% (absolute) better. Indeed, nobody has published a paper where they achieve passable performance on that dataset.
And yet, it is possible to do much better. Not only that, it is possible to solve this problem in a much more useful way — when you don’t have a predefined dataset of “in domain.” When you want to be able to constantly expand the domain set allowed. When you want to be able to allow others (that don’t have access to the raw model) to define expansions to the supported domains. When you want such an expansion to not require massive dataset inclusions — or any at all! In this series of blog posts we will present a proposed approach for this problem that not only accomplishes the previously stated goals, but sets the ground for much richer voice assistant collaborations. As a teaser — if you can teach a voice assistant that it doesn’t know something, and that enables it to ask others voice assistants if it can handle the task — all without predefining explicitly what each supports!
 The best metric to be used to measure this problem is debatable, but that will be a topic for a later blog post.
 I say approach because this is very much an open research problem that we have yet to deploy to customers, but it has demonstrated substantial improvements on the data sets published, and on our own (vastly larger) test sets. This is on a basis of apples-to-apples rather than considering the new functionality that it enables.