Getting Started with Speech to Text

** Update (8th June 2018) **
It appears the API endpoint has recently changed (this may have occurred around MS Build 2018). I have since updated the code sample. Note: To remain up to date, refer to Microsoft's official documentation.

Content

  1. Bing Speech API
  2. Use Cases
  3. Key Concepts
  4. URL Query Parameters
  5. Pricing
  6. Demo: Speech to Text (Python)

1. Bing Speech API

Part of Azure Cognitive Services, the Bing Speech API shares the same underlying speech recognition technology used by other Microsoft products such as Cortana.

At a high level, the API is capable of:

  • Converting Speech to Text
  • Converting Text to Speech
speech_to_text.png

2. Use Cases

  • Transcribe and analyse customer call centre data.
  • Build intelligent applications that can be triggered by voice.
  • Increase accessibility for users with impaired vision.

3. Key Concepts

Utterance
A sequence of continuous speech followed by a clear pause.

Audio Stream
To optimise performance, audio data (e.g. speaking into a mic) is typically collected, sent and transcribed in chunks to form a stream.

API Key
This will be required to programmatically work with the API and can be attained from the Azure Portal once a Bing Speech resource has been created.

4. URL Query Parameters

Recognition Mode
The service optimises speech recognition based on which mode is specified, so it is important to define the mode most appropriate to your application. Concise summary below, for more details check out Microsoft's documentation.

  • Interactive: Formal + Short & Sharp (utterances typically last 2 - 3 seconds).
  • Conversation: Informal.
  • Dictation: Formal + Longer Utterances (full sentences that typically last 5 - 8 seconds).

Language Tag
Define the target language for conversion (e.g. en-AU, en-GB, en-US). See supported languages for a complete list.

Output Format
The response is returned as JSON with the output format set to simple by default.

  • Simple: Result as display text.
  • Detailed: List of results (i.e. all possible interpretations) paired with a confidence score.

5. Pricing

Fortunately for developers, there is a free tier that should be more than sufficient to get you started.

  • Free Tier (F0): Maximum of 5 calls per second; Maximum of 5,000 transactions per month.
  • Standard Tier (S0): Maximum of 20 calls per second; £3GBP/$4USD/$5AUD per 1,000 transactions.

Note: Pricing is as of this post, check Microsoft's website for up to date pricing.

6. Demo: Speech to Text (Python)

In this demo, we will invoke the speech recognition service by using the REST API in Python.

Prerequisites

Steps

1. Create a Bing Speech API resource within the Azure Portal.

Screen Shot 2018-02-09 at 2.40.21 pm.png

2. Create a virtual environment (Python 3) with the requests library.

virtualenv.gif

3. Copy and paste the code sample below into a file within your virtual environment (e.g. handler.py). Ensure to update the API key which you can attain from the Azure Portal under "Bing Speech API > Resource Management > Keys".

Note: The audio file path will also need to be updated (path of least resistance, simply place the audio file into the same directory).

python_code.gif

High Level Flow

python_speech_to_text_flow.png

Code

import json
import requests

YOUR_API_KEY = 'ENTER_YOUR_KEY_HERE'
YOUR_AUDIO_FILE = 'ENTER_PATH_TO_YOUR_AUDIO_FILE_HERE'
REGION = 'ENTER_YOUR_REGION' # westus, eastasia, northeurope 
MODE = 'interactive'
LANG = 'en-US'
FORMAT = 'simple'


def handler():
    # 1. Get an Authorization Token
    token = get_token()
    # 2. Perform Speech Recognition
    results = get_text(token, YOUR_AUDIO_FILE)
    # 3. Print Results
    print(results)

def get_token():
    # Return an Authorization Token by making a HTTP POST request to Cognitive Services with a valid API key.
    url = 'https://api.cognitive.microsoft.com/sts/v1.0/issueToken'
    headers = {
        'Ocp-Apim-Subscription-Key': YOUR_API_KEY
    }
    r = requests.post(url, headers=headers)
    token = r.content
    return(token)

def get_text(token, audio):
    # Request that the Bing Speech API convert the audio to text
    url = 'https://{0}.stt.speech.microsoft.com/speech/recognition/{1}/cognitiveservices/v1?language={2}&format={3}'.format(REGION, MODE, LANG, FORMAT)
    headers = {
        'Accept': 'application/json',
        'Ocp-Apim-Subscription-Key': YOUR_API_KEY,
        'Transfer-Encoding': 'chunked',
        'Content-type': 'audio/wav; codec=audio/pcm; samplerate=16000',
        'Authorization': 'Bearer {0}'.format(token)
    }
    r = requests.post(url, headers=headers, data=stream_audio_file(audio))
    results = json.loads(r.content)
    return results

def stream_audio_file(speech_file, chunk_size=1024):
    # Chunk audio file
    with open(speech_file, 'rb') as f:
        while 1:
            data = f.read(1024)
            if not data:
                break
            yield data

if __name__ == '__main__':
    handler()

Output

{  
   "RecognitionStatus":"Success",
   "DisplayText":"Play Billie Jean by Michael Jackson.",
   "Offset":9700000,
   "Duration":19800000
}

Result Visualised

Screen Shot 2018-02-09 at 3.17.28 pm.png