Image Processing with Cognitive Services

Scenario

So the week before this post, I needed to get an extract of all my financial transactions for the last year. The initial thinking... log on to my banks customer portal and request an export to CSV/XLS. Sounds simple enough? Unfortunately for me (and I suspect for many others), computer says no! The web app would only export the last 30 days so anything prior to that could only be accessed via statements downloaded in a PDF format.

Challenge

Ultimately, I needed to extract meaningful data (Date, Transaction, Amount, etc.) into a structured format (i.e. CSV or XLS) that could be further analysed to derive actionable insights. In order to do this, the content of the PDF would need to be converted into text. Microsoft Cognitive Services to the rescue!

High-Level Flow

  1. Convert each page of each PDF into an image.
  2. Invoke the Computer Vision API to convert each image into text.
  3. Parse the JSON response from the API and output the results into a text file.
pdf_to_text.png

Note: A quick Google search will show there are a ton of ways a PDF can be split and converted into an image. FYI - In my specific example I have used the "Render PDF Pages as Images" action within a simple Automator workflow. That said, in this post I will be primarily focusing on the code used within the Python script to tap into the OCR capabilities of the Computer Vision API.

Computer Vision

Computer Vision provides developers a number of different image processing capabilities by simply invoking a HTTP endpoint. From tagging images based on their content to celebrity recognition. To find out more, check out Microsoft's official documentation.

In order to get started we need to get access to an API key. To do this, create a Computer Vision API resource within your Azure subscription (Azure Portal > Search the marketplace for "Computer Vision API" > Create).

Note: At time of this post, you are entitled to 20 calls per minute (5,000 per month) under the free tier.

ComputerVisionAPI.gif

API Key and HTTP Endpoint

Once the Computer Vision API resource has been created, there are two key pieces of information we need to make note of for later use within our code. Navigate to the sections as outlined below and copy/paste the values into a text editor.

  • Computer Vision API > Overview > Endpoint
  • Computer Vision API > Keys (under Resource Management) > Key 1

Python Environment

Since we can interact with the API via a HTTP endpoint, you are free to use whatever language you like. In this post, I'll focus how it can be done using Python. The only dependency we have is on the requests library, so you will need to install that before updating and executing any code.

PythonEnvRequests.gif

Code

Finally the code sample. To use, simply update the three variables: API_KEYENDPOINT and DIR. That's it, you are good to convert images into text! The code sample will loop through all images in the directory path and dump all the extracted text into a single file. Remember, under the free tier you are limited to only 20 calls per minute.

Notes:

  • The loop that cycles through each file in the nominated directory currently only proceeds if the filename ends in "*.jpeg", you may also need to change this for your specific requirements (i.e. JPG, PNG, etc.).
  • If you completely overwrite the ENDPOINT variable in the code by copy and pasting from the Azure portal, you will need to add "/ocr" to the end of the string.
import os
import json
import requests

API_KEY = 'YOUR_API_KEY'
ENDPOINT = 'https://YOUR_ENDPOINT_REGION.api.cognitive.microsoft.com/vision/v1.0/ocr'
DIR = 'YOUR_PATH_TO_IMAGE_FILES'

def handler():
    text = ''
    for filename in sorted(os.listdir(DIR)):
        if filename.endswith(".jpeg"): 
            pathToImage = '{0}/{1}'.format(DIR, filename)
            results = get_text(pathToImage)
            text += parse_text(results)

    open('output.txt', 'w').write(text)

def parse_text(results):
    text = ''
    for region in results['regions']:
        for line in region['lines']:
            for word in line['words']:
                text += word['text'] + ' '
            text += '\n'
    return text  

def get_text(pathToImage):
    print('Processing: ' + pathToImage)
    headers  = {
        'Ocp-Apim-Subscription-Key': API_KEY,
        'Content-Type': 'application/octet-stream'
    }
    params   = {
        'language': 'en',
        'detectOrientation ': 'true'
    }
    payload = open(pathToImage, 'rb').read()
    response = requests.post(ENDPOINT, headers=headers, params=params, data=payload)
    results = json.loads(response.content)
    return results

if __name__ == '__main__':
    handler()

Video