I am very excited to announce that we are going to be releasing on-device automatic speech recognition (ASR) capabilities this summer. This is still being tested internally, but I wanted to provide some context for the goals of this initiative and what to expect.

A more thorough overview of modern ASR and challenges that come with it can be found here: https://dl.acm.org/doi/pdf/10.1145/3400713.3400715

ASR (simplified)

Right now there are very few companies that have deployed an on-device ASR successfully to mobile devices (and by few, I basically mean just Google). Why? Because it is hard. To understand why it is hard I will need to explain from a 30,000 foot view the very, very basics of speech recognition.[0] Automatic Speech Recognition, or ASR as it is usually called, is the task of taking audio input and turning it into text. In a less academic setting, it is often called Speech-to-Text. Most models and approaches are typically broken down into a few steps.

  1. Feature Extraction: Take in the raw audio[1] and extract information for the model to actually use. This is done because, obviously, a computer cannot hear. In image processing this often involves taking into account the pixel colors. Speech has similar techniques, with some much more complicated and involved than others.
  2. Acoustic Modeling: From the features extracted in step 1, the model learns to map the input audio to letters[2]. In some models this involves converting directly to letters; in others this could be converting into sounds first (phonemes). Some even go directly into full words![3]
  3. Language Modeling: The final stage can be thought of as the cleanup. The text that was transcribed as sounds can be made much more accurate by allowing the model to correct the raw output. For example, it could correct someone saying "I want to read the chat in the hat" to "I want to read the cat in the hat." There are endless numbers of ways that this can be done, but they are critical in making up for the shortcomings of a purely acoustic-based model.

Oftentimes, when I work with newer Deep Learning practitioners, one of the first tricks I teach them is to consider what the intuition is behind these deep models as they relate to a person. While there are many exceptions to this in general, it stands up well if you take a look at the above 3 step breakdown. The human ear can be seen as step (1) above, converting the sound wave into electrical signals that the brain[4] can then interpret. One could say that the human brain then carries out a two-stage process much like steps (2) and (3) above. The electrical signals are first converted into sounds, and then those sounds get combined with our life knowledge to find the closest sentence that makes sense.

On Device

With the above 3 steps in mind, the deployment pattern of speech recognition on devices can be much more easily understood. The fundamental question is where you "split," or partition, between the device and the cloud in the following flow:

audio->features(1) -> acoustic (2) -> language (3).

In 99% of cases, the split happens at the very start. The audio is recorded on the device and then gets uploaded to the server for it to do the rest. On the plus side, you can support speech recognition on any device that has a mic and a network connection. Sadly, it means that you have to spend $$$ on cloud servers in every region to support low latency recognition. And that is not taking into account why this is less desirable for a user. Firstly, they have a worse experience – the speed at which the recognition happens is limited by how much latency there is between their device and the cloud. Secondly, you are not making full use of your hardware. If you already have a phone that has a power CPU, GPU, and even at times a dedicated neural processing chip, then those are all sitting dormant. And perhaps most importantly, you are being forced to upload all of your audio to the cloud. Aside from the impact on privacy,[5] this also means that you wreck your data plan if you don't have WiFi. So why doesn't everyone just do speech recognition on phones when the hardware is available, and delegate to the cloud otherwise? Better yet – why not have the split in the pipeline between device and cloud be variable, so that it is not all or none? Sadly, it is not that trivial.

The MeetKai ASR Initiative: Hybrid Deployment ASR

Our proposed solution to this problem required us to design from the ground up with a hybrid split in mind: depending on the deployment target, the pipeline can be segmented between device and cloud. This means that everything from recording the audio to the final transcription can take place on the device,  or the entire process can be accomplished in the cloud – or anywhere in between! The reason it was important to design it this way from the start, rather than after the fact, can be understood if you look at what is out there already.

  1. Kaldi-based solutions are considered the traditional choice. Kaldi was the de facto approach to ASR in the 2010s. However, its deployment is far from trivial. Its roots in academic research spawned a cottage consulting industry around how to actually deploy it.
  2. DeepSpeech2-based solutions were spawned from Baidu's 2014 research paper. There are two primary open source examples of it – the first from Baidu themselves and the second from Mozilla.
  3. More complex deep models: Facebook, Google, NVIDIA, and MeetKai fall into this big bucket.

The reason we ended up in bucket 3 rather than in 1 or 2 is the same reason that you can't do this after the fact. It becomes evident if you actually try to deploy either of the first two options. My opinions of modern Kaldi deployment are outside the scope of this article, and the project itself seems to be in a transition phase. So we will put that aside for now, and instead take a look at DeepSpeech. If you actually try to replace your cloud speech recognition stack with DeepSpeech running on a device you will likely not have a good time.

  1. The model(s) are not small. The (released) Baidu English acoustic model is 186MB and their English language model comes in at a whopping 8.3GB. The story is worse for Mandarin, with an acoustic model that is 758MB; a "small" language model of 2.8GB and a "large" language model coming in at 70GB. The newer Mozilla DeepSpeech models are much smaller, but not by enough to be considered “app sized”. Their English acoustic model is 45MB,and the language model is 900MB. I don't know about you, but I would not be happy to have an app take up that much space.
  2. The models do not make full use of the native hardware. The DeepSpeech demo from Mozilla for Android does not make any use of the neural accelerator on a device. If you have a high-end flagship Android phone, you might have such a dedicated AI chip included. While the approach can be adapted (and I have seen it done) such that a DeepSpeech model can be deployed to utilize the accelerator, it is hard to say if it is still worth it. This is because of problem (3)...
  3. DeepSpeech models do not have state-of-the-art performance in practice. For fun, we did some testing with users utilizing the default Mozilla DeepSpeech model. We found that they were not happy with the accuracy. This is no fault of Mozilla’s; in fact they are doing really well – speech recognition is just that hard. Sadly, in the AI community, 6 years might as well be a lifetime, and it has been that long since the original Baidu paper was released. There has been a huge surge in new techniques in just the past 2 years that enable an entirely new suite of approaches.

When we release on-device English ASR this summer, you can expect the app to take approximately 100MB[6] of additional space. This is an order of magnitude less than the only publicly[7] available deep model. Furthermore, if your device supports it, the model will make full use of the on-device neural accelerator. On the Kirin 990, which leads the mobile AI leaderboard, we are seeing substantially faster than real time transcription.[8] While this work is still very much an ongoing project, we are extremely excited to share this preview of our initiative.

https://dl.acm.org/doi/pdf/10.1145/3400713.3400715

Notes:

[0] I will be covering technical details of our speech recognition in later posts as we get closer to the release of this feature.

[1] By raw audio, I mean the audio samples, the specifics of which are way beyond the scope of this post. The "explain it like I am 5" version is that 1 second of audio generates many numbers, and each of these numbers represent a single sample. We measure how many samples there are in KHz, or how many thousands of readings were taken a second. Typically in speech recognition we sample at 8KHz, 16KHz, or 48KHz. These numbers can be "played" back for you to hear what was recorded.A very extreme simplification, but sufficient for understanding this post.

[2] Again, a major oversimplification, but this will be the topic for a later post.

[3] The approaches (letters or graphemes, words) can be thought of as different ends of a spectrum, with middle points like sub-words. This is when a language is broken down into a unit greater than a single character length but shorter than words.

[4] This is just about the intuition: after all, deep learning is more so inspired by the human brain, rather than an implementation of it.

[5] I don't mean to understate the importance of this – I just believe it deserves its own post.

[6] We actually have versions running now that come in under 50MB, but there is an obvious trade off in terms of model size vs. accuracy. Furthermore, different languages have different model sizes, but we are working to keep the model sizes all at roughly the same size.

[7] If you include very non-public models, then the Google ASR model for the Pixel is around 100MB as well.

[8] Faster than realtime can be thought of as: how long does it take to transcribe 1 second of audio? Taking 1 second to do so would be considered realtime, anything under that is called "faster than realtime". With our 25MB English model on a P40 Pro, we can transcribe ~2.5 seconds of audio in under 250ms.