Contemporary voice recognition systems over-emphasize learning based on explicit “a=b” training; that is, there is a vital absence of false training.
I imagine a parent and child: the parent says “It’s time for …” as a peal of thunder ripples through the room. This might be used as a comedic device precisely because we would not expect the child to respond “Yes, daddy, I’ve turned on the lights in the kitchen”. I’ve yet to hear a voice system ask me what “ACHOO” means or just say “what was that?”
After a hiatus, I return to Windows speech recognition and am confused by just how far ahead it is of the technology we rely on in Siri, Alexa, Google Home, even Microsoft’s own Cortana.
For training it still relies on the old “speak these words” explicit recognition training. This is basically the same tech that shipped with Windows 7, and this comes back to my point: This approach was already not-even state-of-the-art when Windows 7 shipped.
I believe a far better approach would be to a decoupled training procedure: don’t tell the training system what the user is being asked to say. Instead, use a combination of pre-scripted phrases, common keywords, and insight into the state of the network, to decide what to ask me.
Then, ask the user to exclude options until they are down to something close enough to need individual words correcting.
There are two major gains here: 1. The user gets clear feedback on where the system is struggling to understand, 2. Instead of teaching the system that “*cough*pi” means pizza, and that “zzaplease” means “please”, I can acknowledge the system’s ability match sounds to speech.
The problem of purely positive training is compounded by the assumption engineers make that the systems will only hear deliberate communication.
Think about this: You cough and your voice system activates. You say “I wasn’t talking to you”, and you get a witty reply.
Except: You actually just trained the system that it heard it’s activation word; it may have changed in recent months, but it was certainly true at the start of the year that all the big systems had this flaw.
Nor does being quiet help.
I think this is part of why all of the current systems have the ability to suddenly become dumber on you. Perhaps the microphone is suddenly muffled, or perhaps the subtle changes of you having a cold for a day totally reaffirmed some weak association in the engine and it’ll take you months to untrain it again so it recognizes your regular voice.
It’s my hunch this is why there is so often a clear honeymoon period with devices like Alexa, Google Home etc: you become less forgiving, the system becomes over-confident, the first thing you say gets misunderstood, and your speech pattern changes as you become annoyed, angry or bothered by the device. So instead of your normal voice being the voice it expects, your angry or shouty voice is the one it trains itself on the majority of the time.
Alexa does provide the ability to provide corrective feedback via the Alexa app, but that quickly becomes burdensome and after the first few months, largely seems to be ineffective.
Positive AND negative training are the way forward.