Getting Started with Speech to Text
/** Update (8th June 2018) **
It appears the API endpoint has recently changed (this may have occurred around MS Build 2018). I have since updated the code sample. Note: To remain up to date, refer to Microsoft's official documentation.
Content
- Bing Speech API
- Use Cases
- Key Concepts
- URL Query Parameters
- Pricing
- Demo: Speech to Text (Python)
1. Bing Speech API
Part of Azure Cognitive Services, the Bing Speech API shares the same underlying speech recognition technology used by other Microsoft products such as Cortana.
At a high level, the API is capable of:
- Converting Speech to Text
- Converting Text to Speech
2. Use Cases
- Transcribe and analyse customer call centre data.
- Build intelligent applications that can be triggered by voice.
- Increase accessibility for users with impaired vision.
3. Key Concepts
Utterance
A sequence of continuous speech followed by a clear pause.
Audio Stream
To optimise performance, audio data (e.g. speaking into a mic) is typically collected, sent and transcribed in chunks to form a stream.
API Key
This will be required to programmatically work with the API and can be attained from the Azure Portal once a Bing Speech resource has been created.
4. URL Query Parameters
Recognition Mode
The service optimises speech recognition based on which mode is specified, so it is important to define the mode most appropriate to your application. Concise summary below, for more details check out Microsoft's documentation.
- Interactive: Formal + Short & Sharp (utterances typically last 2 - 3 seconds).
- Conversation: Informal.
- Dictation: Formal + Longer Utterances (full sentences that typically last 5 - 8 seconds).
Language Tag
Define the target language for conversion (e.g. en-AU, en-GB, en-US). See supported languages for a complete list.
Output Format
The response is returned as JSON with the output format set to simple by default.
- Simple: Result as display text.
- Detailed: List of results (i.e. all possible interpretations) paired with a confidence score.
5. Pricing
Fortunately for developers, there is a free tier that should be more than sufficient to get you started.
- Free Tier (F0): Maximum of 5 calls per second; Maximum of 5,000 transactions per month.
- Standard Tier (S0): Maximum of 20 calls per second; £3GBP/$4USD/$5AUD per 1,000 transactions.
Note: Pricing is as of this post, check Microsoft's website for up to date pricing.
6. Demo: Speech to Text (Python)
In this demo, we will invoke the speech recognition service by using the REST API in Python.
Prerequisites
- An Azure subscription
- Sample audio file.
Steps
1. Create a Bing Speech API resource within the Azure Portal.
2. Create a virtual environment (Python 3) with the requests library.
3. Copy and paste the code sample below into a file within your virtual environment (e.g. handler.py). Ensure to update the API key which you can attain from the Azure Portal under "Bing Speech API > Resource Management > Keys".
Note: The audio file path will also need to be updated (path of least resistance, simply place the audio file into the same directory).
High Level Flow
Code
import json
import requests
YOUR_API_KEY = 'ENTER_YOUR_KEY_HERE'
YOUR_AUDIO_FILE = 'ENTER_PATH_TO_YOUR_AUDIO_FILE_HERE'
REGION = 'ENTER_YOUR_REGION' # westus, eastasia, northeurope
MODE = 'interactive'
LANG = 'en-US'
FORMAT = 'simple'
def handler():
# 1. Get an Authorization Token
token = get_token()
# 2. Perform Speech Recognition
results = get_text(token, YOUR_AUDIO_FILE)
# 3. Print Results
print(results)
def get_token():
# Return an Authorization Token by making a HTTP POST request to Cognitive Services with a valid API key.
url = 'https://api.cognitive.microsoft.com/sts/v1.0/issueToken'
headers = {
'Ocp-Apim-Subscription-Key': YOUR_API_KEY
}
r = requests.post(url, headers=headers)
token = r.content
return(token)
def get_text(token, audio):
# Request that the Bing Speech API convert the audio to text
url = 'https://{0}.stt.speech.microsoft.com/speech/recognition/{1}/cognitiveservices/v1?language={2}&format={3}'.format(REGION, MODE, LANG, FORMAT)
headers = {
'Accept': 'application/json',
'Ocp-Apim-Subscription-Key': YOUR_API_KEY,
'Transfer-Encoding': 'chunked',
'Content-type': 'audio/wav; codec=audio/pcm; samplerate=16000',
'Authorization': 'Bearer {0}'.format(token)
}
r = requests.post(url, headers=headers, data=stream_audio_file(audio))
results = json.loads(r.content)
return results
def stream_audio_file(speech_file, chunk_size=1024):
# Chunk audio file
with open(speech_file, 'rb') as f:
while 1:
data = f.read(1024)
if not data:
break
yield data
if __name__ == '__main__':
handler()
Output
{
"RecognitionStatus":"Success",
"DisplayText":"Play Billie Jean by Michael Jackson.",
"Offset":9700000,
"Duration":19800000
}
Result Visualised