Text Analytics with Microsoft Cognitive Services

In a previous post (Text Mining POTUS with Python), I showed how NLTK can be used to analyse raw text input and derive linguistic features using pure Python. Today, we are going to look at how the process of text analytics can be made even easier using the readily available API as part of Microsoft's Cognitive Services suite.

As it currently stands, the API offers three main capabilities:

Sentiment Analysis
Key Phrase Extraction
Language Detection

Prerequisites

An Azure subscription
A Text Analytics API resource (available via the Marketplace)

Tip: Microsoft offer a free tier which allow up to 5,000 transactions per month. Keep in mind, each document processed counts as a transaction.

Demonstration

The example code below demonstrates all three API's. To get the example working, you will need:

Python 3
The "requests" library (pip install requests)
Update two variables with your own values: ACCESS_KEY and URL

The approach for all three API's is identical:

Prepare the HTTP request header to include the ACCESS_KEY.
Construct the URL for the appropriate HTTP endpoint (e.g. languages, keyPhrases or sentiment).
Create a POST request which includes the JSON documents to be processed in the body.
Load the response.

The only subtle difference is the structure of the JSON document input when detecting languages (id & text) vs. extracting key phrases or attaining sentiment (id, language and text).

"""
File Name:      text_analytics.py
Author:         Taygan Rifat
Python Version: 3.6.1
Date Created:   2018-01-18
"""
import json
import requests

# Azure Portal > Text Analytics API Resource > Keys
ACCESS_KEY = 'INSERT_YOUR_ACCESS_KEY_HERE'
# Text Analytics API Base URL
URL = 'https://YOUR_REGION.api.cognitive.microsoft.com/text/analytics/v2.0/'

def get_insights(api, documents):
    """
    Get insights using Microsoft Cognitive Service - Text Analytics
    """
    # 1. Set a Request Header to include the Access Key
    headers = {'Ocp-Apim-Subscription-Key': ACCESS_KEY}
    # 2. Set the HTTP endpoint
    url = URL + api
    # 3. Create a POST request with the JSON documents
    request = requests.post(url, headers=headers, data=json.dumps(documents))
    # 4. Load Response
    response = json.loads(request.content)

    print('------------------------------------')
    print('API: ' + api)
    for document in response['documents']:
        print(document)

def language_detection():
    """
    The API returns the detected language and a numeric score between 0 and 1 indicating certainty.
    """
    documents = {
        'documents': [
            {"id":"1", "text":"Le renard brun rapide saute par-dessus le chien paresseux" },
            {"id":"2", "text":"敏捷的棕色狐狸跳过了懒狗" },
            {"id":"3", "text":"The quick brown fox jumps over the lazy dog" }
        ]
    }
    get_insights('languages', documents)

def key_phrases():
    """
    The API returns a list of strings denoting the key talking points in the input text.
    """
    documents = {
        'documents': [
            { "id":"1", "language":"en", "text":"Apple's plan to bring home hundreds of billions of dollars in overseas cash has triggered a guessing game on Wall Street about what it might do with all that money. The tech giant could find itself with about $200 billion to spend, after taxes, if it repatriates all its overseas holdings into the U.S." },
            { "id":"2", "language":"en", "text":"Tableau Software is revamping a core part of its technology to analyse data faster, a move intended to keep up with its customers' increasing big-data needs. The Seattle company, which makes software to visualise analytics, is introducing its so-called Hyper engine in a software update Jan 17. The technology is designed to make the data-visualisation process five times faster, meaning businesses can input millions of data points and see results in seconds." },
            { "id":"3", "language":"en", "text":"Reviews of the Tesla Model 3 praise the car as a futuristic, mold-breaking car that may be the best electric vehicle at its price point. But that doesn't mean it's perfect. Overall, Tesla's first attempt at a less expensive car than their higher-end S and X models has received strong acclaim for its smooth, quiet ride, uniquely minimalist interior and dashboard, and body design." }
        ]
    }
    get_insights('keyPhrases', documents)

def sentiment():
    """
    The API returns a numeric score between 0 and 1. Scores close to 1 indicate positive sentiment, and scores close to 0 indicate negative sentiment.
    """
    documents = {
        'documents': [
            { "id":"1", "language":"en", "text":"What a great way to run the public transport in a city ! Loved the regular frequency, clear mapping and the accessible stops. Well done Melbourne !" },
            { "id":"2", "language":"en", "text":"Boarding at Spring st, near Parliament station - initially very crowded as the previous tram broke down, the journey went half way around the city - when at the corner of Flinders and Spencer St we were ll advised to disembark - as it was the end of the drivers shift - and there was no replacement driver - over 100 people were left stranded - truly a poor example of Melbourn hospitality. - Many tourists not knowing how to get back to there original destination." },
            { "id":"3", "language":"en", "text":"What a terrific way to get around the Melbourne CBD. You can hope on any tram within the CBD area and it is free. The Number 35 tram does a complete circuit of the CBD with commentary about Melbourne landmarks but it can get very crowded. Make sure you use it." },
        ]
    }
    get_insights('sentiment', documents)

if __name__ == '__main__':
    language_detection()
    key_phrases()
    sentiment()

Output

------------------------------------
API: languages
{'id': '1', 'detectedLanguages': [{'name': 'French', 'iso6391Name': 'fr', 'score': 1.0}]}
{'id': '2', 'detectedLanguages': [{'name': 'Chinese_Simplified', 'iso6391Name': 'zh_chs', 'score': 1.0}]}
{'id': '3', 'detectedLanguages': [{'name': 'English', 'iso6391Name': 'en', 'score': 1.0}]}
------------------------------------
API: keyPhrases
{'keyPhrases': ['overseas cash', 'overseas holdings', 'home hundreds of billions', 'guessing game', 'dollars', "Apple's plan", 'Wall Street', 'tech giant', 'money', 'taxes'], 'id': '1'}
{'keyPhrases': ['millions of data points', 'data-visualisation process', 'software update', 'big-data needs', 'Tableau Software', 'technology', 'businesses', 'so-called Hyper engine', 'times', 'results', 'analytics', "customers'", 'seconds', 'core', 'Seattle company'], 'id': '2'}
{'keyPhrases': ['expensive car', 'mold-breaking car', 'minimalist interior', 'quiet ride', 'strong acclaim', 'X models', 'higher-end S', 'dashboard', 'best electric vehicle', 'price point', 'body design', 'Tesla Model', 'attempt', 'Reviews', "Tesla's"], 'id': '3'}
------------------------------------
API: sentiment
{'score': 0.9607083797454834, 'id': '1'}
{'score': 0.15545335412025452, 'id': '2'}
{'score': 0.9676148891448975, 'id': '3'}

Sentiment Analysis Example - IMDB User Reviews

Sentiment analysis can have a number of real-world business use cases, from analysing support calls in an effort to better understand Voice of the Customer, to supporting strategies when trading on financial markets. That said, in this example we are going to see if we can determine the quality of a movie by analysing the sentiment of user reviews from IMDB.

The Movies

Star Wars: Episode IV – A New Hope (1977)
Star Wars: Episode V – The Empire Strikes Back (1980)
Star Wars: VI – Return of the Jedi (1983)
Star Wars: I – The Phantom Menace (1999)
Star Wars: II – Attack of the Clones (2002)
Star Wars: III – Revenge of the Sith (2005)

High-Level Flow

The Python script scrapes user reviews from IMDB. The response is received as raw HTML.
The HTML is parsed and converted into JSON as an array of documents (ID, Language and Text). The HTTP POST request is made to the Text Analytics API with the JSON passed as data.
The HTTP response contains the results in JSON as an array of documents (ID, Score).
The final output is saved to CSV (Movie Name, Document ID, Score).

Code

Note: ACCESS_KEY and SENTIMENT_URL will need to be updated. The variable LIMIT acts as a kind of throttle (currently set to 5).

"""
File Name:      imdb.py
Description:    Calculate sentiment score for movie user reviews from IMDB.
Author:         Taygan Rifat
Python Version: 3.6.1
Date Created:   2018-01-18
"""
import csv
import json
import requests
from lxml import html

# IMDB
LIMIT = 5
MOVIES = [
    { "imdb_id":"tt0076759", "title":"Star Wars: Episode IV - A New Hope (1977)" },
    { "imdb_id":"tt0080684", "title":"Star Wars: Episode V - The Empire Strikes Back (1980)" },
    { "imdb_id":"tt0086190", "title":"Star Wars: Episode VI - Return of the Jedi (1983)" },
    { "imdb_id":"tt0120915", "title":"Star Wars: Episode I - The Phantom Menace (1999)" },
    { "imdb_id":"tt0121765", "title":"Star Wars: Episode II - Attack of the Clones (2002)" },
    { "imdb_id":"tt0121766", "title":"Star Wars: Episode III - Revenge of the Sith (2005)" },
]
IMDB_URL = 'http://www.imdb.com/title/'
HEADERS = {
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.'
}

# Azure - Text Analytics API
ACCESS_KEY = 'INSERT_YOUR_ACCESS_KEY_HERE'
SENTIMENT_URL = 'https://YOUR_REGION.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment'

def handler():
    """
    Handler
    """
    # Loop through each movie 
    for movie in MOVIES:
        print('Getting User Reviews for: ' + movie['title'])

        # 1. Initialise variables
        counter = 0
        comments = []

        # 2. Get Comments
        comments = get_comments(counter, None, comments, movie)

        # 3. Construct JSON for Text Analytics API HTTP request
        documents = {}
        documents['documents'] = []
        for comment in comments:
            documents['documents'].append({ 'id': comment[0], 'language': 'en', 'text': comment[1] })

        # 4. Get Sentiment
        results = get_sentiment(documents, movie)

        # 5. Dump comments to JSON
        with open('./input/' + movie['imdb_id'] + '.json', 'w') as outfile:
            json.dump(documents, outfile)

        # 6. Write results to CSV
        with open('./output/' + movie['imdb_id'] + ".csv", "w", newline='') as csv_file:
            writer = csv.writer(csv_file, delimiter=',')
            for line in results:
                writer.writerow(line)

    print('Finished!')

def get_comments(counter, data_key, comments, movie):
    """
    Get User Reviews from Load More
    """
    # 1. Get listings
    url = None
    if counter == 0:
        url = IMDB_URL + movie['imdb_id'] + '/reviews?ref_=tt_urv'
    else:
        url = IMDB_URL + movie['imdb_id'] + '/reviews/_ajax?ref_=undefined&paginationKey=' + data_key

    request = requests.get(url, HEADERS)
    response = html.fromstring(request.content)
    listings = response.xpath('//div[@class="lister-list"]/div')

    # 2. Loop through each User Review
    for listing in listings:
        # a) Get Text Element
        review = listing.xpath('.//div[@class="content"]/div[@class="text"]')[0]

        # b) Get Text Value
        review_text = None
        comment = None
        if review.text:
            review_text = review.text
            counter += 1
            comment = [counter, review_text]
        else:
            pass

        # c) Append comment
        if comment and counter <= LIMIT:
            comments.append(comment)
        else:
            pass

    # 3. Get Load More Data Key
    data_key = response.xpath('//div[@class="load-more-data"]')

    # 4. If Data Key exists
    if data_key and counter <= LIMIT:
        get_comments(counter, data_key[0].attrib['data-key'], comments, movie)
    else:
        return comments

def get_sentiment(documents, movie):
    "Get sentiment score for each comment"
    # 1. Get Sentiment Scores from Text Analytics API
    headers = {'Ocp-Apim-Subscription-Key': ACCESS_KEY}
    request = requests.post(SENTIMENT_URL, headers=headers, data=json.dumps(documents))
    response = json.loads(request.content)

    # 2. Parse results
    results = []
    for document in response['documents']:
        document_id = document['id']
        score = document['score']
        line = (movie['title'], document_id, score)
        results.append(line)

    # 3. Return results
    return results

if __name__ == '__main__':
    handler()

Evaluation

Movies ranked in order of IMDB Rating:

Star Wars: Episode V – The Empire Strikes Back (1980) [IMDB Rating: 8.8]
Star Wars: Episode IV – A New Hope (1977) [IMDB Rating: 8.7]
Star Wars: Episode VI – Return of the Jedi (1983) [IMDB Rating: 8.4]
Star Wars: Episode III – Revenge of the Sith (2005) [IMDB Rating: 7.6]
Star Wars: Episode II – Attack of the Clones (2002) [IMDB Rating: 6.6]
Star Wars: Episode I – The Phantom Menace (1999) [IMDB Rating: 6.5]

Movies ranked in order of Sentiment Score:

Star Wars: Episode V – The Empire Strikes Back (1980) [Sentiment Score: 8.7]
Star Wars: Episode IV – A New Hope (1977) [Sentiment Score: 8.4]
Star Wars: Episode I – The Phantom Menace (1999) [Sentiment Score: 8.0]
Star Wars: Episode VI – Return of the Jedi (1983) [Sentiment Score: 7.8]
Star Wars: Episode II – Attack of the Clones (2002) [Sentiment Score: 5.0]
Star Wars: Episode III – Revenge of the Sith (2005) [Sentiment Score: 2.2]

Notes:

In order to stay within the free tier's transaction limit, results are based on ~500 reviews per movie (i.e. ~3,000 reviews in total).
In an ideal world we would have calculated sentiment based on as much data as posisble but for the purposes of this exercise to convey proof of value, the existing data set should be sufficient.
If you do re-hash this exercise, be aware your results may differ depending on which sample of user comments are analysed and the possibility that Microsoft's API has since been updated.
While the sentiment score from Microsoft is provided as a value between 0 and 1, in order to make the comparisons more digestable when compared to IMDB, I have multipled the results by 10 (e.g. 0.87 = 8.7).

Insights:

Results are inline with IMDB:

Episode V - The Empire Strikes Back (1980) is the best episode in the series.
Episode IV- A New Hope (1977) is the second best episode in the series.
Episodes II and III are poorer quality movies in comparison.

Results out of sync:

Episode I – The Phantom Menace. According to the IMDB, this was the lowest rated movie but the sentiment analysis gave it a favourable score of 8.0.

Lastly, the range of values for Episove V (as depicted by the boxplot visualisation) is a lot narrower. This may point to a tigher consensus amongst reviewers compared to other episodes.

Results Visualised

Finished

Hopefully this gives you a taste of how sentiment analysis can be used and just how accessible the technology is with ready to consume, publicly available services such as Microsoft's Text Analytics API.

Taygan

Taygan

BLOG

Taygan

Prerequisites

Demonstration

Sentiment Analysis Example - IMDB User Reviews

Taygan