Is there something going on with speech recognition, beyond the obvious? Microsoft’s deal for Nuance and its Dragon voice recognition could be seen as a speech-to-text enhancement to Office, but it could also be seen as a way of bringing speech recognition into a higher plane of productivity and personal activity support. Or both.
Speech to text is one of those things that’s been around forever but hasn’t really made much news lately. The idea is to have a person talk naturally and a computer tool transcribe the speech directly into an application like a word processor, browser, or other productivity tool. I used Microsoft’s embedded speech to text feature when an accident with a cleaning service resulted in a burn to my right hand that kept me from typing at my usual speed for about two weeks. It worked, but not perfectly, and once I recovered my typing, I gladly abandoned the speech to text model.
Dragon from Nuance is arguably the best of the tools available, and it’s widely used not so much by writers as by people like doctors who have to transcribe impromptu notes. It’s much more efficient to have verbal comments transcribed than to sit down at a computer, particularly since professionals who do have transcription needs are often not facile typists in the first place. But Microsoft has tools here already, and would Dragon be worth buying for almost $20 billion, well above the rumored premium, if the only motivation was the traditional speech recognition and speech to text?
Nuance’s stock was stagnant until the start of the pandemic, and from there it took a sharp upward path. The presumption here is that WFH has disconnected a lot of people from their normal administrative aid, making it necessary for key people to produce documents on their own. In addition, Zoom or Teams meetings are things that can benefit from transcribed notes, since the meetings in raw video-recording form are hard to find things in quickly. Most users say that the embedded transcription services in videoconference tools are pretty limited.
Whatever the case, the problem here is that the pandemic is ending. It would have made a lot of sense for Microsoft to buy Nuance in March, when the upside was just developing, but less so when the only real driver for a change in the value of Nuance for over a decade is likely to wind down. I think that it’s unlikely that Microsoft would want to buy Nuance at a higher-than-expected premium when Nuance’s current revenue sources are either static or at risk of a decline. There would have to be something else on the table.
A hint of what that might be has been dribbling out of conversations I’ve been having with vendor and cloud technology planners. Some of these have been telling me that Amazon, Apple, Google, and even Microsoft have been looking at how “personal assistant” technology could relate to something I’ve blogged about for years—contextual services. A contextual service is one that’s delivered to someone based on a highly developed notion of what the individual is trying to do. It’s a step beyond the normal notion of an “assistant”, and you can actually see some of the tentative movement in the new direction already.
Have you noticed that your personal assistant seems to be getting more attentive? That there are situations where it jumps into your life even if you haven’t said the traditional wake-up phrase? According to some of my vendor contacts, that’s because the design of these tools is changing, taking the personal assistant into what’s an almost-human role of a co-party in all your conversations. There are obviously major privacy constraints on this sort of evolution, so we shouldn’t expect to have personal assistant providers shouting their intentions from rooftops, but there are major benefits as well.
Imagine that you’re having a conversation with a friend, family member, or co-worker, and sitting in on the conversation is a discrete, trusted, third party. Most of the time, being discrete, that phantom member of your conversation stays quiet, but it’s following along and from time to time, when there’s some particularly compelling contribution to be made, it quietly inserts its contribution. That, according to one of my sources, is what personal-assistant developers are looking at. It would be valuable in itself if done correctly, and it could be an on-ramp to contextual services.
For those who didn’t read my blogs on that topic (or didn’t find them that memorable!), contextual services are services delivered to an individual based on an inference of what the individual is trying to do, and designed to facilitate that action or action set. The foundation of this sort of service isn’t what you deliver, but how you know what to deliver, meaning how you can understand the context of the user of the service and interpret requests or inject information based on it.
My presumption in the contextual services space has always been that the first step would be to apply contextual knowledge in creating a response to an explicit request. My friends in the personal assistant space have suggested that’s shooting behind the duck, that contextual services would necessarily be an evolution of the personal assistant, reflecting the assistant’s shift in role from asking questions to volunteering information.
This shift isn’t a universal factor in contextual services, it’s a factor in applications that involve voice communications between involved parties. If a worker is ranging through a railroad yard looking for an axle, the search isn’t likely to involve the worker asking the axle where it is, so there’s no voice to analyze. On the other hand, a discussion between two workers on how to find it could create a voice trail that would lead an augmented personal assistant to cut through the crap and tell the pair where the axle is likely to be found based on database records.
The reason this is important is that the pandemic and WFH have created an environment where collaboration via electronic means is more accepted; people are habituated to it. That could mean that they are more prepared for an aggressive personal assistant, and that their behavior is more likely to include significant conversational components that can be interpreted to determine context.
There are a lot of implications if it is true that Microsoft sees the conversational gold in Nuance. If conversation analysis can establish behavioral context reliably, then you could apply contextual computing to enhancing productivity or quality of life without having to deploy a vast number of sensors to develop a model of the real world and the user’s place in it. In fact, a conversational basis for context could be applied quickly to a small number of people, where something based on IoT or “information fields” created by retail operations could require considerable up-front investment to get to a point where a single user could be served. You might also promote edge computing, because one model of implementation would involve an edge-hosted “user agent” that did the conversational analysis and gathered the useful information for presentation.
There’s no assurance that this is what Microsoft sees in Nuance, of course. Corporations often have blinders on with respect to opportunity; they see the future as a linear evolution of their own past. Maybe Microsoft does believe that Dragon speech to text is worth twenty billion. I don’t think it is, and I’m hoping they see something more exciting instead.