One of the fundamental building blocks of a voice assistant, much like a search engine, is the concept of ranking – given a set of equally true results, how do you determine the order? An example of this equally true set would be seen if a user asked for "Tom Cruise Movies." Jerry Maguire and Mission Impossible are both equally movies that equally star Tom Cruise. And yet, for two different hypothetical users, which film is placed as #1 and which as #2 is the difference between a movie purchased and watched,and nothing at all. This doesn't even take into account the fact that you might have results 1-4 as Mission Impossible movies, and then off screen at #5 would be the romance movie that they crave.
In the search world, this is sometimes solved by re-framing the task. Instead of trying to return the top result, you instead seek to maximize the top-k results. This metric, often called Precision at K, measures what percentage of the top K (e.g. 10) results are "good." With this as the metric, many new tactics can be utilized – if you want to handle the scenario where K=5, then a ranking algorithm that picks a diverse set of results is likely to have at least one good result there. In building MeetKai, however, we set out to find a way that each person gets their best top result. To this end, we have created the Kai Score – a measurement between 0 and 100 that allows us to rank a particular result as first. On its own, this is hardly a unique concept – Netflix has their "Personalized Match Percent," after all. However, what they optimize for are two different goals. This is more easily understood in the context of differences between the primary schools of thoughts behind personalized recommendations.
Explicit Feedback Model
In the Explicit Feedback Model, the recommendation system attempts to predict the rating a user would assign to an item based upon previous ratings they have assigned to other items. When Netflix held the Netflix Prize Competition in 2006, this was a given as the way to go. The design of that competition is still a good example of the intuition behind an explicit feedback model. The data that was presented to contestants came in the form of a list of grades (ratings between 1-5) assigned by users to different movies. The goal of the contest was to take users that have been seen in the training set and infer what grades they would assign to yet unseen movies. On its face, this is the rational way to approach the problem. I think Scarface is an easy 90%, and also think that Goodfellas is up there as well – a model should be able to figure out that people that gave high ratings to Scarface and Goodfellas are also likely to give a high rating to The Godfather.
The devil, of course, is in the details. How do you take into account movies that are naturally rated higher? How do you take into account users that naturally rate movies higher? While there are approaches to address this (in the form of user and item biases) they are outside the scope of this post. Suffice it to say that these approaches have enough drawbacks to leave one questioning the end goal of the explicit feedback model. For a company like Netflix, is the goal to suggest a movie to a user that they will rate high after watching,or one that they will hit the watch button on? I wouldn't rate Tiger King highly on the "how good is it scale," and yet I – along with much of the country – binged it. Furthermore, the average user does not rate items in general. Most purchasers of products don't review them. Most restaurant goers don't write reviews. And as can be seen by Netflix switching from a scale of 1-5 stars to the simpler thumbs up/down, the average user did not rate shows on Netflix. This train of thought is what brings us to the other (very commonly used) school of thought...
Implicit Feedback Model
In the Implicit Feedback Model the recommendation system attempts to predict if a user will interact with an item given previous interactions they have had with other items. Note the key difference between the implicit and the explicit model – explicit models predict ratings, implicit models predict interactions. Two concepts that, while similar, are far from identical.
These interactions are by their very nature implicit. It is easy to see the allure of this approach from every side.
- You don't need to trust the accuracy of a user's ratings – whereas the explicit model has to take into account both user and item bias, the implicit model can throw that out the window! Furthermore, it is possible to still weight the results to a certain extent – if a user watched more episodes of a TV series, we can treat that as a stronger implicit interaction.
- It is easy to optimize from a business perspective. The goal of most recommendation and ranking systems is to drive higher user interaction. What is an easier metric for a streaming company to optimize for: the average rating a user is giving to content, or how many hours they watch a month? Given all of the pitfalls of depending on ratings, the appeal for the latter is massive.
- Many interesting metrics from "side" implicit interactions can be used. In the explicit model, we would only log the interaction when a user rated the item. With an implicit approach, we can treat interactions like reading the description, clicking the poster, or watching the trailer as different levels of implicit interactions.
- It eliminates the negative feedback problem. This is the entire question of if a recommendation system should take into account bad ratings to determine what to recommend. Some papers argue that you can either treat such negative interactions (a dislike) as a weakly positive implicit interaction (You watched it after all!), while others make the case that these should just be thrown away altogether to simplify the task.
Implicit|Explicit + Content = Hybrid Systems
In the above two models, all that matters are the interactions - the movies themselves could just as well be widgets. All that matters is having data of the form:
user_id, item_id, rating
This is because most collaborative filtering approaches are based on the premise that similar users like similar things. Hardly an unbelievable concept, and yet one that is terrible for new users. A user that has just started using the service will not have nearly enough ratings to get meaningful suggestions. Furthermore, technical limitations of many of these systems mean that a user's ratings will not be taken into account in real time, but only after enough likes or dislikes have been accumulated. In general this is called the cold start problem. The common solution to this is to use side-data – systems that purely rely on side data are called "Content Based." Systems that use both the interactions and the content data are aptly named hybrid systems.
Hybrid++++ = Kai Score
It can be easy to guess what most companies use these days: an implicit model that maybe makes use of side-data on items for new users. I would argue that just saying, “That is good enough,”is a major disservice to users. The Kai Score is our attempt to solve this problem through a few primary initiatives. The details of each of these could – and likely will – be their own posts, but for now they paint a picture of what you can expect when you see a high Kai Score. Namely the score is our attempt to accomplish the following non-exhaustive list:
- Make use of explicit interactions: If a user liked a movie that should be different than if they disliked it.
- Make use of explicit negative interactions: A user not liking a movie is not just a sign to not suggest similar movies, it might be a signal that they actually like a different genre of movies.
- Make use of temporal implicit interactions: Not only do we need to take into account if a user had some implicit interaction with an item, but also when it took place. This is a detail often ignored in these systems.
- Use side data to not only help new users, but always: The model should (and does) take into account an increasingly growing knowledge graph that backs and connects every item in our database. This means that not only should we take into account things like Tom Cruise was in this movie but more specifically, Tom Cruise was in this movie in the 1990s. This example of combining side data to form even newer and richer signals is instrumental in out-performing simpler approaches.
Most important, but missing from that list, is the key function that makes the Kai Score for movies different from what you may see on a streaming site:he goal of the Kai Score is to exist globally across all of our domains. All of the above interactions, both explicit and implicit, and side data of all sorts, exist outside the contents of their originating domain. What you like in streaming gives a strong hint about what you may like in books, podcasts, or perhaps even recipes. While we are still at a very early stage of deployment of the Kai Score, we are extremely optimistic about what possibilities such a framework opens up.
Note: The Kai Score is available to select users while the app is in tech preview – as we expand our user base and get closer to a beta, we will roll it out to more testers.