Speech Recognition APIs in Windows Phone 8

Windows Phone 8 has a new fully integrated engine and APIs for using speech recognition. I’ll dive into it in this post.

Introduction

In order to use speech recognition in the application, we have to make sure the application capabilities are set to the ones we need to activate speech recognition in the WMAppManifest.xml file. We need to add two capabilities that aren’t checked by default:

  • ID_CAP_MICROPHONE
  • ID_CAP_SPEECH_RECOGNITION

ActivateSpeechCapabilitiesWindowsPhone8

A basic speech recognition example will look something like this:

private async void Button_Click_1(object sender, RoutedEventArgs e)

{

    //SpeechRecognizerUI class, engine + ui.

    var recognizerWithUI = new SpeechRecognizerUI();

    // Activate recognition (with the dictation grammar by default). Get result

    SpeechRecognitionUIResult recognizerResult = await recognizerWithUI.RecognizeWithUIAsync();

    // Display the result in a TextBox.

    MessageBox.Show(string.Format(“I heard {0}.”, recognizerResult.RecognitionResult.Text));

}


Lets go over what these lines mean:

  • First thing I did was to mark the entire button click method as async (can contain synchronized operations)
  • I initialized a new variable of type SpeechRecognizerUI, which is the class that is responsible for opening a new window with a default UI, and handling the speech recognition itself (getting sound, sync against grammar, etc).
  • I called the method RecognizeWithUIAsync() to preform the actual action of displaying the default window of the speech recognition, and specified that I want to get the result of the recognition into SpeechRecognitionUIResult.

1SpeechRecognitionUIWindowsPhone8     SpeachRecognizerUIShowingResultWindowsPhone8

The default way to retrieve and analyze the recognition is from an automatic web service that deciphers the correct words. Network access needs to be enabled and currently connected for this to work.

  • I did whatever I want with the result- in this case simply displayed it in a MessageBox.

Grammars

Speech recognition runtime is an engine that compares the given speech to a known set syntax and words. Several options exist as to where and how these words and syntax are defined, and they divide into two categories: online grammars which are written for us by Microsoft and local ones which are written by us and loaded at runtime.

When using an online grammar, we need to specify just one (otherwise a runtime exception will be thrown when trying to add the second dictionary), but when pre- loading local dictionaries we can add several different ones into the same recognizer object.

The narrower and more localized the grammar is, the better the engine will recognize what the user actually said, and the faster the process will be, so the size of the dictionary should also be taken into consideration.

Online Grammars

Online grammars use a remote service to retrieve the words said by the user, that means that a network connection is a must for these to work, and also that they can be potentially slower in comparison to the local ones. Also we have no control over the words they contain, those are pre- built grammars given to us by the nice people of Microsoft.

There are two options with online grammars:

  • Dictation Grammar, which is the default grammar if no other grammar was specified (like the first example shown above). This free text grammar contains potentially everything. This engine doesn’t need to be set manually but it can be set by using the method AddGrammerPreDefinedType on the grammars collection in a SpeechRecognizer object with a name of our choosing, and the Dictation type from the SpeechPredefinedGrammar enum.

SpeechRecognizer reco = new SpeechRecognizer();

// Add a dictation grammar to the grammars collection in the recognizer.

reco.Grammars.AddGrammarFromPredefinedType(“typeName”,

                    SpeechPredefinedGrammar.Dictation);

 

  • Web search grammar, that is optimized for a web search. It is very similar to the full dictation dictionary with slight differences (looks for shorter phrases, looks for matches that would make sense when searching the web). Again we set it up manually using the same method with the WebSearch type from the SpeechPredefinedGrammar enum.

SpeechRecognizer reco = new SpeechRecognizer();

// Add a WebSearch grammar.

reco.Grammars.AddGrammarFromPredefinedType(“typeName”,

                    SpeechPredefinedGrammar.WebSearch);

 

Local Grammars

Local grammars are loaded and set up in code, and therefor can only recognize exactly what we defined for them, thus improving on performance and accuracy. One SpeechRecognizer object can have several different local grammars and they can be of different local types.

The local types of grammar are

  • A list of strings defined in code (any collection that implements IEnumerable<string>), that is set in the recognizer’s grammars property again, this time by using the AddGrammarFromList method, with the list of string we created:

            SpeechRecognizerUI recoWithUI = new SpeechRecognizerUI();

            // Create an array of strings

            string[] babylon5Characters = { “John Sheridan”, “Jeffry Sinclair”, “Susan Ivanova”, “Delenn”, “Londo Molari”, “G’Kar” };

            // Add array of strings to grammars

            recoWithUI.Recognizer.Grammars.AddGrammarFromList(“myB5CharactersList”, babylon5Characters);

SpeechRecognitionFromListWindowsPhone8

(Try sending the word “G’Kar” to the online grammars…)

And we can add more list grammars, and they will all work together

            string[] babylon5Characters = { “John Sheridan”, “Jeffry Sinclair”, “Susan Ivanova”, “Delenn”, “Londo Molari”, “G’Kar” };

            // Add array of strings to grammars

            recoWithUI.Recognizer.Grammars.AddGrammarFromList(“myB5CharactersList”, babylon5Characters);

            List<string> farscapeCharacters = new List<string> { “John Crichton”, “Aeryn Sun”, “Ka Dargo”, “Chianna”, “Rygel” };

            // Add another grammar a list of strings

            recoWithUI.Recognizer.Grammars.AddGrammarFromList(“myFarscapeCharactersList”, farscapeCharacters);

MultipleListsDictionarySpeechRecognitionWindowsPhone8  MultipleListsDictionarySpeechRecognitionWindowsPhone82

  • SRGS grammar- which is an XML based file, written in the standard Speech Recognition Grammar Specifications format. These XMLs are meant for more complex sentences and syntax and have a lot of ways to change and create. How to write an SRGS XML is out of the scope of this blog post, I have provided links to resources  about SRGS and how to write them at the end of this post. In order to add an GRGS file to our application we use the method AddGrammarFromURI, which accepts only absolute URIs.

    SpeechRecognizerUI recoWithUI = new SpeechRecognizerUI();

    // Initialize URI with my path to the SRGS XML file.

    Uri babylon5Grammer = new Uri(“ms-appx:///Babylon5Characters.grxml”, UriKind.Absolute);

    // Add the SRGS grammar to the grammars collection.

    recoWithUI.Recognizer.Grammars.AddGrammarFromUri(“b5″, babylon5Grammer);

With the SRGS file of:

<?xml version=1.0 encoding=utf-8 ?>

<grammar version=1.0 xml:lang=en-US root=characters tag-format=semantics/1.0

          xmlns=http://www.w3.org/2001/06/grammar          xmlns:sapi=http://schemas.microsoft.com/Speech/2002/06/SRGSExtensions>

  <rule id=charactersscope=public>

    <one-of>

      <item> John Sheriden </item>

      <item> Jeffry Sinclair </item>

      <item> Londo Molari </item>

      <item> Delenn </item>

      <item> Susan Ivanova </item>

    </one-of>

  </rule>

</grammar>


And as I explained before we can also combine different local dictionaries:

    Uri babylon5Grammer = new Uri(“ms-appx:///Babylon5Characters.grxml”, UriKind.Absolute);

    // Add SRGS grammar from file to Grammars

    recoWithUI.Recognizer.Grammars.AddGrammarFromUri(“b5″, babylon5Grammer);

    List<string> farscapeCharacters = new List<string> { “John Crichton”, “Aeryn Sun”, “Ka Dargo”, “Chiana”, “Rygel” };

    // Add list of strings to grammar as well

    recoWithUI.Recognizer.Grammars.AddGrammarFromList(“myFarscapeCharactersList”, farscapeCharacters);

API and objects

The main object at work here is the recognizer- the engine. The class is SpeechRecognizer, and we can simply instantiate it and work with it as it is. If we want to load the default UI, we use the SpeechRecognizerUI class which has the SpeechRecognizer wrapped in it as a property called Recognizer.

In the SpeechRecognizer we can change or set things including:

  • Language- there are several available installed engines on the phone, and we can select one from the InstalledSpeechRecognizers object, which contains a list of all the available languages and a property for the default engine used (the default engine is chosen according to the OS language). To get or set the Recognizer language we use the SpeechRecognizer’s methods SetRecognizer and GetRecognizer.

If we want to set the recognizer’s language in code, we can choose a language from the installed ones, and set it to the recognizer:

    recoWithUI = new SpeechRecognizerUI();

    recoWithUI.Recognizer.SetRecognizer(InstalledSpeechRecognizers.All.First());

The default recognition language set by the OS is accessible in the InstalledSpeechRecognizers.Default property.

Automatically (I changed the OS language to German):

OSLanguageGermanWindowsPhone8  DefaultLanguageChangeToGermanWindowsPhone8

 

  • Speech SettingsSettings are a property in SpeechRecognizer of type SpeechRecognizerSettings, and they have 3 settings that are editable, although the entire property is NOT settable:

recognizer.Settings.BabbleTimeout = TimeSpan.FromSeconds(42);

recognizer.Settings.EndSilenceTimeout = TimeSpan.FromSeconds(42);

recognizer.Settings.InitialSilenceTimeout = TimeSpan.FromSeconds(42);

  • Grammar settings- which were already discussed earlier, we can either use the property Grammars to get existing grammars as an object called SpeechGrammar set (that is a wrapper around a list with some extra), or add grammars to the list using the given Add methods (AddGrammarFromList, AddGrammarFromPredefinedType and AddGrammarFromURI)
  • Async options- we can use the SpeechRecognizer’s PreloadGrammarAsync method to load grammars ahead of time, and therefor avoid waiting time when starting to use the recognizer. The method RecognizeAsync which starts the recognizing action in SpeechRecognizer or RecognizeWithUIAsync method for SpeechRecognizerUI both have only async calls.
  • Audio events – these are two events that are already handled automatically by the SpeechRecognizerUI object, but if we want to create our own logic, we can subscribe to them. AudioCaptureStateChanged event occurs when the capturing sound mode has changed, AudioProblemOccurred occurs occurs when there is some problem with the audio. The code for handing these events will be:

private SpeechRecognizer _recognizer;

public MainPage()

{

    InitializeComponent();

    _recognizer = new SpeechRecognizer();

    // Subscribe to the AudioCaptureStateChanged event.

    _recognizer.AudioCaptureStateChanged += Recognizer_AudioCaptureStateChanged;

    // Subscribe to the AudioProblemOccurred event.           

    _recognizer.AudioProblemOccurred += Recognizer_AudioProblemOccurred;

}

void Recognizer_AudioProblemOccurred(SpeechRecognizer sender, SpeechAudioProblemOccurredEventArgs args)

{

    // Get the value of the enum SpeechRecognitionAudioProblem from event args, to determine nature of problem

    SpeechRecognitionAudioProblem speechRecognitionAudioProblem = args.Problem;

    //Handle problem accordingly

    if (speechRecognitionAudioProblem == SpeechRecognitionAudioProblem.TooSlow)

    {

        //Do something…

    }

    else if (speechRecognitionAudioProblem == SpeechRecognitionAudioProblem.NoSignal)

    {

        //Do something else

    }

    //…Continue with more problem options

}

void Recognizer_AudioCaptureStateChanged(SpeechRecognizer sender, SpeechRecognizerAudioCaptureStateChangedEventArgs args)

{

    //Get the state of the audio that has changed to figure out what happened from event args

    SpeechRecognizerAudioCaptureState speechRecognizerAudioCaptureState = args.State;

    //Handle changed audio state

    if(speechRecognizerAudioCaptureState == SpeechRecognizerAudioCaptureState.Capturing)

    {

        //Do something…

    }

    else if (speechRecognizerAudioCaptureState == SpeechRecognizerAudioCaptureState.Inactive)

    {

        //Do something else

    }

}

SpeechRecognizerUI

SpeechRecognizerUI is a wrapper around the SpeechRecognizer (the Recognizer property). The SpeechRecognizerUI also has a Setting property of type SpeechRecognizerUISettings, which we can also edit:

var recoWithUI = new SpeechRecognizerUI();

string[] babylon5Characters = { “John Sheriden”, “Jeffry Sinclair”, “Susan Ivanova”, “Delenn”, “Londo Molari”, “G’Kar” };

recoWithUI.Recognizer.Grammars.AddGrammarFromList(“myB5CharactersList”, babylon5Characters);

//Display to the user what we are expecting as an answer

recoWithUI.Settings.ExampleText = “Londo Molari”;

//Ask the user a question

recoWithUI.Settings.ListenText = “Who is your favorite Babylon 5 character?”;

// Read the result out loud or not

recoWithUI.Settings.ReadoutEnabled = false;

//Show the confirmation page after a successful recognition.

recoWithUI.Settings.ShowConfirmation = false;

SpeechRecognitionUIResult recoResult = await recoWithUI.RecognizeWithUIAsync();


ChangedDefaultUISettingsOnSpeechRecognizerUI

SpeechRecognitionResult/ SpeechRecognitionResultUI-

These are the objects that represent the recognition result and information from the recognition (gotten from calling the method RecognizeAsync from SpeechRecognizer object or from the method RecognizeWithUIAsync from SpeechRecognizerUI according to the original class we used).

SpeechRecognitionResult contains:

  • Text property– A string representation of the spoken recognized words.
  • TextConfidence property– An enum of type SpeechRecognitionConfidence that represents how confident the engine is the result is accurate.
  • GetAlternates() method which returns a read-only list of SpeechRecognitionResult of alternate possible results. In this example I gave the dictionary a list of similar words, and used alternate results to display them all:

var recoWithUI = new SpeechRecognizerUI();

string[] confusingBabylon5Characters = { “Delenn”, “Gelenn”, “Mellen”, “Bellen” };

recoWithUI.Recognizer.Grammars.AddGrammarFromList(“myB5CharactersList”, confusingBabylon5Characters);

SpeechRecognitionUIResult recoResult = await recoWithUI.RecognizeWithUIAsync();

StringBuilder builder = new StringBuilder();

//Get alternate results from recognition, (provided 10 as the maximum number of alternate results I want to retrieve)

IReadOnlyList<SpeechRecognitionResult> speechRecognitionResults = recoResult.RecognitionResult.GetAlternates(10);

foreach (SpeechRecognitionResult speechRecognitionResult in speechRecognitionResults)

{

    builder.AppendLine(speechRecognitionResult.Text);

}

MessageBox.Show(builder.ToString());

AlternateResultsForSpeechRecognition

  •  Semantics property- the semantics are interpretation tags taken from the SRGS file in order to work with a set meaning of a rule (as a tag) rather then the what was actually said. They can be added to SRGS files in the correct Semantic Interpretation for Speech Recognition format. The Semantics property is a collection of semantics, available as a dictionary with the key being the output variable from the SGRS file and the values being the tags of the words spoken by the user.

For example, in this SRGS I have a sentence with two selection outputs (the name of a Babylon 5 character and the name of a Farscape character). The semantics in this case are just the names’ initials (I’ve added resources about Semantics by the end of this blog post if you want to understand the syntax better):

<?xml version=1.0 encoding=utf-8 ?>

<grammar xmlns=http://www.w3.org/2001/06/grammar xml:lang=en

          xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance

          xsi:schemaLocation=http://www.w3.org/2001/06/grammar

                             http://www.w3.org/TR/speech-grammar/grammar.xsd

          version=1.0 mode=voice tag-format=semantics/1.0 root=main>

  <rule id=main scope=public>

    <item repeat=0-1>character</item>

    <ruleref uri=#babylon5Characters/>

    <tag>out.MyB5Character=rules.babylon5Characters;</tag>

  <item repeat=0-1>and</item>

    <ruleref uri=#farscapeCharacters/>

  <tag>out.MyFarscapeCharacter=rules.farscapeCharacters;</tag>

  </rule>

  <rule id=babylon5Characters>

    <one-of>

      <item> John Sheriden <tag>out=”JS”;</tag></item>

      <item> Jeffry Sinclair <tag>out=”SJ”;</tag></item>

      <item> Londo Molari <tag>out=”LM”;</tag></item>

      <item> Delenn <tag>out=”D”;</tag></item>

      <item> Susan Ivanova <tag>out=”SI”;</tag></item>

    </one-of>

  </rule>

  <rule id=farscapeCharacters>

    <one-of>

      <item> John Crichton <tag>out=”JC”;</tag></item>

      <item> Aeryn Sun <tag>out=”AS”;</tag></item>

      <item> Ka Dargo <tag>out=”KD”;</tag></item>

      <item> Chianna <tag>out=”C”;</tag></item>

      <item> Rygel <tag>out=”R”;</tag></item>

    </one-of>

  </rule>

</grammar>


And to access the semantics in code we can simply write for example:

var recoWithUI = new SpeechRecognizerUI();

Uri babylon5Grammer = new Uri(“ms-appx:///Babylon5Characters.grxml”, UriKind.Absolute);

recoWithUI.Recognizer.Grammars.AddGrammarFromUri(“b5″, babylon5Grammer);

SpeechRecognitionUIResult result = await recoWithUI.RecognizeWithUIAsync();

//I use StringBuilder and add the entire semantics dictionary to a list of strings.

StringBuilder builder = new StringBuilder();

foreach (KeyValuePair<string, SemanticProperty> keyValuePair in result.RecognitionResult.Semantics)

{

    //For each semantic on the list I want to add the key (name of the out param from xml), how many values it has, and what is the first available value)

    builder.AppendLine(keyValuePair.Key);

    builder.AppendLine(keyValuePair.Value.GetAllValues().Count().ToString());

    builder.AppendLine(keyValuePair.Value.GetAllValues().First().ToString());

}

MessageBox.Show(builder.ToString());


And the result for my input would look like:

HeardYouSayResultWindowsPhone8   SemanticsCollectionWindowsPhone8

SpeechRecognitionResultUI

contains-

  • RecognitionResult property which is the SpeechRecognitionResult object
  • ResultStatus property which is of type SpeechRecognitionUIStatus enum, indicating the outcome status of the recognition

SpeechRecognitionUIResult result = await recoWithUI.RecognizeWithUIAsync();

SpeechRecognitionUIStatus speechRecognitionUiStatus = result.ResultStatus;

if (speechRecognitionUiStatus == SpeechRecognitionUIStatus.Succeeded)

{

    //Do one thing

}

else if (speechRecognitionUiStatus == SpeechRecognitionUIStatus.Cancelled)

{

    //Do something else

}

//Add more enum values to handle…


Resources:

This entry was posted in Windows Phone 8 and tagged , , , . Bookmark the permalink.

One Response to Speech Recognition APIs in Windows Phone 8

  1. Joe says:

    Hey, great article… I’ve got my program running great, and I’m also using a custom list of choices.

    Now, I need to see if I can do the same thing, but without a speech recognizer UI showing up, I would imagine that is a snap.

    Thanks,

    -joe

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>