Introduction
This article is an overview of speech recognition and voice control concepts, and ways they can be implemented. There are various ways to use speech recognition these days, and many people may do it via some cloud based system like the Amazon Echo, but this is mostly about all-local solutions. They both have their pros and cons.
This article is coming at the subject more from the perspective of controlling devices through voice commands, usually by way of some sort of automation system.
Background
Any code here is from my CIDLib
system, which is a large (about 1100 classes and 450,000 lines of code) open source project of mine on Github, which you can find here:
My CQC automation system is built on top of CIDLib
. It makes use of the speech recognition capabilities provided by CIDLib
to implement "CQC Voice" which is a voice controlled 'front end' to CQC, allowing the user to carry out common operations easily via spoken commands.
Some of this will make more sense to see (and, hear) it in action as well, so I have also done a related video on Youtube, which is here:
Speech Recognition Concepts
There are two primary sorts of speech recognition. There is dictation and there are command and control grammars. Dictation is used for what it says, dictating spoken content so that it can be 'written down' in a hands free way. It's the high tech version of the old Dictaphone, which people used to use to make audio notes to themselves or to be transcribed by a secretary. Now we can just turn the audio directly into text (not always the right text, but it is text.)
Because of what dictation is used for, it has to be able to recognize fairly arbitrary spoken content. So it often involves training for each user, i.e., they have to sit down and say a bunch of words so that the software can understand how they make various sounds, and the software needs to know who is speaking. It may involve specialized forms of speech for good recognition. And the person using the software usually has to speak fairly carefully with good word separation. It's still not unusual that errors will occur and need to be corrected, because it's just a very tough thing to get right.
Dictation really isn't of any use in the voice control oriented world. For an automation system to make sense of arbitrary spoken text is too much, and the restrictions on how the user speaks is too much. So voice control oriented systems use the other style, command and control, which involves a 'grammar'. A grammar consists of a set of 'utterances' or 'rules'. These define a limited set of sentences that will be recognized. Flexibility is provided by allowing parts of the utterance/rule to either be one of a set of choices, or possibly a small number of arbitrary words. It usually has some sort of identifier. Some examples might be:
[SetHouseMode] Put the house into {mode} mode
[SetLightLevel] Set the {light} light to {percent} percent
This is just an arbitrary example syntax. Each utterance/rule has a name, then some associated text. Most to all of the text will be fixed, but they can have 'replacement tokens' that represent a spot where the user can say some arbitrary words or some words from a list of pre-determined options. So, in some systems, {mode} in the above example may have a pre-defined set of values, in others it can be whatever words the user says there, within reason of course. When the speech recognition software gets a match, it might pass to the automation system something like:
SetLightLevel, light=Kitchen, percent=50
This is all the automation system cares about. It doesn't want to get into the business of parsing out spoken text and trying to make sense of it. All it needs to know is the name of the rule/utterance and the variable bits, and possibly some associated (non-spoken) associated information (as we'll see below.) It will use that to decide what to do.
Though still a very difficult thing to get right, grammar based schemes vastly reduce the number of possibilities that the speech recognition software has to try to deal with. And, even then, a reasonably sized grammar for an automation system may still involve hundreds of thousands of possible combinations of words.
The speech recognition engine goes through a number of states to get from an incoming audio wave form to a string of words. Much of it is doctoral thesis stuff and beyond me sadly, but it has to do various forms of analysis to try to recognize sounds, which it can match to parts of words, which it can break into chunks that form words, which it can match to utterances/rules.
Speech Recognition Grammars
The above example is quite basic, though something not unlike that might be used in simpler systems like the Echo. For a more full featured grammar, it can get quite complex. On Windows, one uses the Speech Platform SDK to implement speech recognition. I have, of course, encapsulated that functionality in CIDLib
, to make it far easier to use.
Grammars on Windows can be in a couple forms, but one is a .grxml file, which is an XML representation of the grammar. Here is a simple example, the same one used in the companion video:
="1.0"="ISO-8859-1"
<grammar version="1.0"
xml:lang="en-US"
root="TopRule"
tag-format="semantics/1.0"
xmlns="http://www.w3.org/2001/06/grammar"
xmlns:sapi="http://schemas.microsoft.com/Speech/2002/06/SRGSExtensions"
sapi:alphabet="x-microsoft-ups">
<rule id="TopRule" scope="public">
<item>
<one-of>
<item><ruleref uri="#GoAway"/></item>
<item><ruleref uri="#Shutup"/></item>
</one-of>
</item>
</rule>
<rule id="GoAway">
<item repeat="0-1">
Please
</item>
<item>
go away
</item>
<tag>
out.Action = "GoAway";
out.Type = "Cmd";
out.Target = "Boss";
</tag>
</rule>
<rule id="Shutup">
<item repeat="0-1">
<one-of>
<item>Would you please</item>
<item>Please</item>
</one-of>
</item>
<item>
shut up
</item>
<tag>
out.Action = "Shutup";
out.Type = "Cmd";
out.Target = "MotherInLaw";
</tag>
</rule>
</grammar>
Ignoring the main element which is just a lot of housekeeping stuff, the rest of the content is defining grammar rules, which are the grxml version of utterances, i.e., rules define recognizable sentences in the grammar. In the case of grxml, because it is intended for use with some fairly complex grammars, it is very much oriented towards breaking rules into reusable pieces that can be combined in various ways, to avoid redundancy.
So you can define rules and bits of rules and reference those within other rules. That in and of itself of course can create its own form of complexity, but it's far better than massive replication of content, which would typically happen otherwise.
In our case, we start the process in the main element by pointing to the top level rule, where we want the recognition engine to start:
root="TopRule"
Everything else beyond that is a hierarchy of rules and references to other sub-rules. Each <rule>
element has an id
that identifies it for reference in other rules. Our top level rule just refers to the other (actual) rules that are to be available for consideration.
The top level rule is marked 'public
' so that it can be seen outside of the .grxml file. It's the only one that needs to be. If we had other grxml files that referenced this one and used rules within it, they would have to be public as well.
Grammar Structure
The grammar structure in grxml is composed of various elements, each of which defines part of the rule hierarchy. An <item>
element by itself is an unconditional part of a rule, a piece of text that must be spoken in order to match that rule. But you can also provide repetition counts for items, such as saying it can be present from x to y number of times. The most common scenario is 0 to 1, which means it is optional, e.g. <item repeat="0-1">
.
There is also <one-of>
which means one of the contained elements must be spoken. This allows you to provide variations that the user can speak. Such variation is key to allowing for a more natural grammar. However, it is also key to creating enormous amounts of ambiguity, making it far harder for the speech recognition engine to reliably match spoken sentences to rules/utterances. Each variation might significantly increase the number of possible paths through the hierarchy.
- One-of is also the grxml version of the replacement token mechanism mentioned above, which we'll get into below.
In our example above, our main rule indicates that any valid spoken command must be one of a set of other rules that it lists, in this case just two:
<one-of>
<item><ruleref uri="#GoAway"/></item>
<item><ruleref uri="#Shutup"/></item>
</one-of>
So our two actual rules are GoAway
and Shutup
. #GoAway
refers to another rule in this file with an id
of GoAway
. If we look at the actual GoAway
rule, it looks like this:
<rule id="GoAway">
<item repeat="0-1">
Please
</item>
<item>
go away
</item>
<tag>
out.Action = "GoAway";
out.Type = "Cmd";
out.Target = "Boss";
</tag>
</rule>
So it has an option leading word of 'please
', and a required phrase of 'go away
'. So you can either say 'go away
' or 'please go away
'.
Semantic Information
The rest of the rule is a <tag>
element that defines 'semantic information' which is a key aspect of the process. This is information that you define that will be made available to your program when the user's spoken input matches this rule. This is how you know what happened, and it relates back to the replacement tokens in the simpler examples above In this case, it's not replacement tokens that are sent to you, but whatever arbitrary information you want to associate with the command.
Since the Windows engine doesn't allow for arbitrary spoken text in utterances/rules, as something like the Amazon Echo does, at most you can only provide optional parts of the rule, so the user can say this or that option. But even then you can't get those spoken bits of text. So we have to use semantic values to let the application know which option was matched. Here is an example from CQC Voice of a sub-rule that lets the user say a media player transport command:
<rule id="TransportCmds" scope="private">
<one-of>
<item> pause <tag> out.Cmd = "Pause"; out.Desc = "paused"; </tag> </item>
<item> play <tag> out.Cmd = "Play"; out.Desc = "played"; </tag> </item>
<item> resume <tag> out.Cmd = "Play"; out.Desc = "resumed"; </tag> </item>
<item> restart <tag> out.Cmd = "Play"; out.Desc = "started"; </tag> </item>
<item> start <tag> out.Cmd = "Play"; out.Desc = "started"; </tag> </item>
<item> stop <tag> out.Cmd = "Stop"; out.Desc = "stopped"; </tag> </item>
</one-of>
</rule>
Each of the items in the <one-of>
element represents a possible transport command, and each of them sets a couple semantic values. Any higher level rules that include a reference to this rule can access these out.Cmd
and out.Desc
values. Sometimes, they will use them to set their own, higher level semantic values.
So you can sort of get the same effect as the replacement token text scheme, but still only in a static way, you can't really have arbitrary text in the rules and be told what words were spoken there. Of course, for some types of applications, that might be a deal breaker.
Dynamic Grammars
If you could only have completely fixed grammars that would be even more limiting, but that is not the case. Grammars can be dynamic in that you can put place holders in the grammar into which you will, at runtime, insert actual rule content that you generate on the fly, or load from configuration.
For instance, in our CQC Voice, we need to know the names of your lights and your security zones and your media players and so forth in order to know what you are referring to when you speak commands. We do that by having you tell us about what rooms you have and what gear is in each room. We use that information to plug those lists into the overall static grammar file on the fly to customize the response to each room.
Of course, this raises vastly larger issues still of ambiguity into the grammar. When you are building a static grammar, you can make a conscious effort to arrange the syntax to get around the worst issues. But when you are inserting arbitrary user configured text into the grammar, that's not generally possible. For instance, if you had a rule or utterance like "Turn the {light} light to fifty percent", what if one of the user's lights was named 'the light'?
But, ultimately you have to use this sort of dynamic grammar capability to provide real world control grammars for anything that's not a bespoke grammar for a single application/user.
If your grammar is in fact static, you can just pre-compile for faster loading. Otherwise, you will compile it on the fly in your application. See the SpGrammarComp
sample program in CIDLib
for how that is done.
Application Interaction
When the speech recognition engine gets a match, it calculates some confidence levels for the various rules and such, and spits out an event to the application. CIDLib
converts this to a TCIDSpeechRecoEv
object which gets filled in. You can look at the contents of this event object to decide what to do. In some cases, you won't do anything because the confidence levels are too low. Rules often get triggered spuriously by conversation in the room, TVs going, and so forth.
If you just dump one out to an text output stream, you would see something like this (for our GoAway
rule above):
{
RULE: GoAway
BASIC CONF: 0.93
TIME: Wed, Apr 03 18:55:49 2019 -0400
SEMANTIC
{
[0.91] Action=GoAway
[0.91] Type=Cmd
[0.91] Target=Boss
}
RULES
{
[0.93] /GoAway
}
}
So we are told the top level rule that was triggered, a basic confidence, and a time stamp. Then all of the semantic information is provided with confidence levels for each of them. And finally all of the individual rules that made up the final top level matched rule and confidence levels for them.
So you can see that we got a good match in this case with confidences in the 90 percent range, making it clear we should react to this event. We would just use the semantic information to decide what the user wanted and to call a method to carry out that command, passing along semantic info where such commands need to be parameterized based on some specific option matched.
In a realistic application, typically you would have a thread dedicated to watching for such events. That thread would either carry out the commands itself, or it would queue them up for other threads to do the actual work. Since all commands would be coming from a single speaker (as in human speaker, not a speaker in a box), typically you would serialize them, though it's possible that some would be treated specially and prioritized which might argue for separate spooling and processing threads via a priority based queue.
At it's simplest, the processing loop might look like this:
TCIDSpeechRecoEv sprecevCur;
tCIDAudStream::EStrmStates eState;
while (!thrThis.bCheckShutdownReq())
{
if (sprecoTest.bGetNextRecEvent(sprecevCur, 250, eState))
m_colProcQ(sprecevCur);
}
It's just waiting a while for an event. If one shows up, it queues it up for someone else to process it, and continues until the thread is asked to shut down.
Pros and Cons
There are various pros and cons to the available means for voice control. Primarily, there is the local vs. cloud conundrum. Cloud based systems are not under your control, are subject to availability, may be at any time cut off, or the terms of use changed on you. And of course, there are always various tin-foil hat issues, which may or may not actually require all that much tin-foil to be concerned about. All-local solutions are under your control, not dependent on any outside parties or systems, and everything you say isn't going to be potentially available to some large corporation.
On the other hand, cloud based systems have available to them the latest and greatest deep neural network (DNN) based speech recognition technology, which makes a significant difference. They are very good at rejecting background noise and picking out the voice command over the screaming kids and loud TV. So they are more likely to 'get it right' than the local system, other things being mostly equal.
Also, systems like the Echo come with their own hardware. Voice control systems typically cannot use a single microphone, unless it is right in front of you. Picking up a voice in a room, with all its reflective surfaces and background noise is tricky, so voice control systems use what are called 'microphone arrays', which are a linear or circular collection of small microphones. The mic array hardware takes advantage of the slightly different arrival time of sounds to reject or reduce those not of interest, enhance those of interest, and to ignore reflections. And because it is done via sound wave phase trickery, there's no physical repositioning of a microphone that has to occur. It happens almost instantly.
These types of microphone arrays are available for third party use, but most of them are targeting the corporate conference room market, and hence are not commodity products. In days of old, many people would use a Microsoft Kinect box. It did many things, but among them, it had a fairly good mic array for this sort of work. But the Kinect has been discontinued. There are some DIY solutions, but those will only be useful to a very small niche of potential voice control users, willing to pick up a soldering iron, and their performance may or may not be on par with commercial solutions.
You can also greatly improve performance in some cases by using a compressor and noise gate, devices that are more typically associated with music production. But they can greatly increase success by reducing the incoming signal's dynamic range and by removing background noise between spoken commands. But of course, they will have to be well tuned, and that can be difficult since you may be a foot from the array or thirty feet. Some mic array products may have one or both of these capabilities built in to varying degrees of sophistication.
It's unlikely that high quality DNN based voice control systems are going to be available for all-local use soon, at least none that are open systems that can be interfaced to by automation systems, and which are reasonably priced for end users. Such systems require enormous amounts of data to train the neural network, and that is a huge barrier to entry into this world, so most companies will keep that advantage to themselves and tie you to their cloud platforms in the bargain. And of course, they can use all those incoming customer commands as ongoing training fodder, further refining their neural networks, an advantage that an all local product would not have.
In the end, you have to pick your own poison, as the saying goes.