Text Mining POTUS with Python

What is Text Mining?

"Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output." - Wikipedia

In other words: The process of creating data out of data (text), with the objective of gaining new insights (classification, sentiment, relationships, etc). 

Natural Language Toolkit (NLTK)

In this post, I will be using NLTK. NLTK is a suite of Python libraries that can be used for statistical natural language processing. For complete install instructions see: http://www.nltk.org/install.html. Alternatively, the standard pip command should be sufficient to get you going.

pip install nltk

In addition to the NLTK suite, there are several data packages that will be required. Specifically punkt,  stopwords and wordnet. To download these specific packages, run the Python interpreter and type the following commands:

import nltk
nltk.download()

Note: A new window should pop up allowing you to select the appropriate packages and click Download.

  • punkt can be found under "Models".
  • stopwords and wordnet can be found under "Corpora".

A Single Tweet

Using the text contained within one of Donald Trump's latest tweets, we are going to engineer two features (i.e. creating data from data).

  • Word Count: The frequency in which a word is present in our parsed text.
  • Word Class: Also known as "parts of speech", the category in which a word belongs (Noun, Verb, Adjective, Adverb).
Screen Shot 2018-01-13 at 8.20.41 am.png

Step 1. Header Code
Once you have installed NLTK, create a new python file (e.g. nltk.py) and add the following code:

import string
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn

FRIENDLY_WORD_CLASS = {
    "n": "Noun",
    "v": "Verb",
    "a": "Adjective",
    "s": "Adjective Satellite",
    "r": "Adverb"
}

TRUMP_TWEET = (
    'Reason I canceled my trip to London is that I am not a big fan of the Obama Administration ha'
    'ving sold perhaps the best located and finest embassy in London for “peanuts,” only to build '
    'a new one in an off location for 1.2 billion dollars. Bad deal. Wanted me to cut ribbon-NO!'
)

Step 2. Tokenize
Our first step in structuring the input text is to tokenize each element, separating words from punctuation. Using NLTK this can be achieved in a single line using word_tokenize and passing our input text as a parameter value.

# 1. Tokenize Words
words = nltk.word_tokenize(TRUMP_TWEET)

Output

['Reason', 'I', 'canceled', 'my', 'trip', 'to', 'London', 'is', 'that', 'I', 'am', 'not', 'a', 'big', 'fan', 'of', 'the', 'Obama', 'Administration', 'having', 'sold', 'perhaps', 'the', 'best', 'located', 'and', 'finest', 'embassy', 'in', 'London', 'for', '“', 'peanuts', ',', '”', 'only', 'to', 'build', 'a', 'new', 'one', 'in', 'an', 'off', 'location', 'for', '1.2', 'billion', 'dollars', '.', 'Bad', 'deal', '.', 'Wanted', 'me', 'to', 'cut', 'ribbon-NO', '!']

Step 3. Cleanse
As the Wikipedia definition suggests, the process of text mining typically involves both the addition and removal of data. In this case, we are going to remove what is commonly referred to as "stop words" (e.g. the, is, at, which, etc).

While there is no universal list, NLTK has a data package to get us started which we can enrich further with our own list. These types of words are generally deemed to be of little value during natural language processing and therefore by removing this type of data we reduce the noise-to-signal ratio, making it easier to glean potential insights.

In the example below, we enrich the nltk list of stopwords by adding punctuation both from Python's string library as well as our own custom list. Once the list of stop words has been generated, we filter our original input text and output a cleansed list.

# 2. Generate a list of stop words (e.g. and, or, at, for)
stop_words = stopwords.words('english') + list(string.punctuation) + ['“','”']
# 3. Cleanse tokenized list of words
filtered_words = [word for word in words if word not in stop_words]

Output Visualised

tweet_cleansed.png

Step 4. Enrich
We are going to enrich our data set by engineering two features, Word Count and Word Class.

a) Word Count
Python's Counter module can be used to create a distinct list of values along with the number of times those values originally appear.

# 4. Count occurences of each word
counted_words = Counter(filtered_words)

Output

Counter({'I': 2, 'London': 2, 'Reason': 1, 'canceled': 1, 'trip': 1, 'big': 1, 'fan': 1, 'Obama': 1, 'Administration': 1, 'sold': 1, 'perhaps': 1, 'best': 1, 'located': 1, 'finest': 1, 'embassy': 1, 'peanuts': 1, 'build': 1, 'new': 1, 'one': 1, 'location': 1, '1.2': 1, 'billion': 1, 'dollars': 1, 'Bad': 1, 'deal': 1, 'Wanted': 1, 'cut': 1, 'ribbon-NO': 1})

b) Word Class
NLTK's wordnet package can be used to tag each word with the appropriate class. We will achieve this in two parts. First by creating a function that gets and returns our word class for a given word, second by updating our main function to loop through each word and print our results.

Note: The most_common() method used below simply sorts our Counter object in descending order.


    # 5. Results
    for word, count in counted_words.most_common():
        word_class = get_word_class(word)
        row = [word, count, word_class]
        print(row)


def get_word_class(word):
    """
    Get World Class
    """
    # Initialise the Word Class variable
    word_class = None
    
    # Proceed if the word exists in the data package
    if len(wn.synsets(word)) > 0:
        # Get word class
        word_class = wn.synsets(word)[0].pos()
        # Convert word class key to friendly text
        word_class = FRIENDLY_WORD_CLASS[word_class]
    else:
        pass

    # Return Word Class
    return word_class

Output

['I', 2, 'Noun']
['London', 2, 'Noun']
['Reason', 1, 'Noun']
['canceled', 1, 'Verb']
['trip', 1, 'Noun']
['big', 1, 'Adjective']
['fan', 1, 'Noun']
['Obama', 1, None]
['Administration', 1, 'Noun']
['sold', 1, 'Verb']
['perhaps', 1, 'Adverb']
['best', 1, 'Noun']
['located', 1, 'Verb']
['finest', 1, 'Adjective Satellite']
['embassy', 1, 'Noun']
['peanuts', 1, 'Noun']
['build', 1, 'Noun']
['new', 1, 'Adjective']
['one', 1, 'Noun']
['location', 1, 'Noun']
['1.2', 1, None]
['billion', 1, 'Noun']
['dollars', 1, 'Noun']
['Bad', 1, 'Noun']
['deal', 1, 'Noun']
['Wanted', 1, 'Verb']
['cut', 1, 'Noun']
['ribbon-NO', 1, None]

Code Sample

Putting all our steps together your code should hopefully look something like this...

import string
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn

FRIENDLY_WORD_CLASS = {
    "n": "Noun",
    "v": "Verb",
    "a": "Adjective",
    "s": "Adjective Satellite",
    "r": "Adverb"
}

TRUMP_TWEET = (
    'Reason I canceled my trip to London is that I am not a big fan of the Obama Administration ha'
    'ving sold perhaps the best located and finest embassy in London for “peanuts,” only to build '
    'a new one in an off location for 1.2 billion dollars. Bad deal. Wanted me to cut ribbon-NO!'
)

def handler():
    # 1. Tokenize Words
    words = nltk.word_tokenize(TRUMP_TWEET)

    # 2. Generate a list of stop words (e.g. and, or, at, for)
    stop_words = stopwords.words('english') + list(string.punctuation) + ['“', '”']

    # 3. Cleanse tokenized list of words
    filtered_words = [word for word in words if word not in stop_words]

    # 4. Count occurences of each word
    counted_words = Counter(filtered_words)

    # 5. Results
    for word, count in counted_words.most_common():
        word_class = get_word_class(word)
        row = [word, count, word_class]
        print(row)

def get_word_class(word):
    # Initialise the Word Class variable
    word_class = None

    # Proceed if the word exists in the data package
    if len(wn.synsets(word)) > 0:
        # Get word class
        word_class = wn.synsets(word)[0].pos()
        # Convert word class key to friendly text
        word_class = FRIENDLY_WORD_CLASS[word_class]
    else:
        pass

    # Return Word Class
    return word_class

if __name__ == '__main__':
    handler()

Evaluation and Interpretation

As you can probably appreciate, the volume of text contained in a single tweet is not going to be sufficient in order for us to glean insights into Trump's use of the English language. So what's next?

More data!
The visualisation below is based on all of Donald Trump's tweets since the 1st January 2017 (not including retweets). Approximately 2,341 tweets in total.

Note: I registered with Twitter's developer platform and used another Python library called "Tweepy" to extract all this data. Will detail how to work with Twitter API's using Python in another post.

Top 20 Words Tweeted by @realDonaldTrump
1st January 2017 till 14 January 2018

trump_words_2017.png

He sure seems to like the word great? Which started to get me thinking, what would the top words look live over time? 

Top 3 Words Tweeted by @realDonaldTrump by Month
1st January 2017 till 14 January 2018

Screen Shot 2018-01-15 at 8.43.48 am.png

Insight: The word great was Donald Trump's most used word for every single month since the 1st January 2017.

With a bit of regex magic (?i)great (\w+).*?\s*, we can extract all the words that followed great. Below is a list of the top 10.

  1. great again - 71
  2. great honor - 40
  3. great job - 23
  4. great state - 21
  5. great day - 14
  6. great meeting - 14
  7. great people - 13
  8. great healthcare - 12
  9. great country - 11
  10. great american - 10

Hopefully, this article gives you a basic understanding of Text Mining and how Python can be used to engineer attributes to gain insights into previously unstructured data such as text

Lastly, just wanted to finish off with a quick visualisation I pulled together based on analysis of all the text contained in Fire and Fury. Below a word cloud of characters in the book weighted by their mentions.

fury_cloud.png