Azure Cosmos DB Graph API with Python

What is a graph database?

"In computing, a graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. A key concept of the system is the graph (or edge or relationship), which directly relates data items in the store. - Wikipedia

The example below illustrates the basic concept. Two nodes (vertices) connected by a single edge (relationship).

basic_graph.png

When reading up on graph computing you may notice that the words Node and Vertex are used interchangeably. It is important to note that there is no difference between them.

With that in mind, from here on in I will stick with Vertex as this is the terminology seen throughout the Gremlin query language used by Azure Cosmos DB.

Why should I use a graph database?

Highly connected data models where relationships are critically important (if not more important) than the nodes themselves, typically benefit from being stored and queried from graph databases. Traditional databases tend to slow down when handling data that is relationship intensive (i.e. many joins) whereas graph databases remain highly performant.

1. Setting Up

1.1 Create an Azure Cosmos DB account
To get started you will need to create an Azure Cosmos DB account with the API set to Gremlin (graph). Gremlin is the graph traversal language of Apache TinkerPop (an open-source graph computing framework).

azure_cosmos_create.png

1.2 Create a Graph
Once the resource has been successfully deployed, launch Data Explorer and create a new graph.

  1. Azure Cosmos DB > Data Explorer > New Graph
  2. Enter a Database ID (e.g. cosmosDb)
  3. Enter a Graph ID (e.g. cosmosCollection)
  4. Change the Throughput (e.g. 400)
  5. Click OK
graph_create.png

1.3 Python Virtual Environment
We will be using the gremlinpython library to programmatically load our graph database. In this example, I am using Python 3.6.1 and gremlinpython 3.2.6 (Note: There seems to be an incompatibility issue with Cosmos DB and the latest release of gremlinpython which is 3.3.1 at time of this post).

virtualenv -p python3 .
source bin/activate
pip install gremlinpython==3.2.6
graph_venv.gif

1.4 Header Code
Create a new file (e.g. graph.py) and add the following code. Your primary key can be found in the Azure portal under Cosmos DB > Keys.

from gremlin_python.driver import client

ENDPOINT = 'ENTER_YOUR_COSMOS_DB.gremlin.cosmosdb.azure.com'
DATABASE = 'ENTER_YOUR_GRAPH_DB'
COLLECTION = 'ENTER_YOUR_GRAPH_COLLECTION'
PRIMARY_KEY = 'ENTER_YOUR_PRIMARY_KEY'

2. Example Graph

In the example below there are 6 nodes with 8 edges that describe the following:

  • Two people (Tim and Jonathan) work for a company called Apple.
  • Tim manages Jonathan.
  • Tim and Jonathan have Leadership as a common skill.
  • Jonathan is also competent in Design and Innovation.
example_graph.png

In our python file, we will be executing gremlin queries. While the language syntax is quite descriptive, you may want to head over to Apache TinkerPop's Getting Started to get a basic understanding of the commands.

2.1 Queries
The code below contains queries to create our Vertices and Edges as described in our scenario as an array of strings.

VERTICES = [
    "g.addV('PERSON').property('id', 'P1').property('name', 'Tim Cook').property('title', 'CEO')",
    "g.addV('PERSON').property('id', 'P2').property('name', 'Jonathan Ive').property('title', 'Chief Design Officer')",
    "g.addV('COMPANY').property('id', 'C1').property('name', 'Apple').property('location', 'California, USA')",
    "g.addV('SKILL').property('id', 'S1').property('name', 'Leadership')",
    "g.addV('SKILL').property('id', 'S2').property('name', 'Design')",
    "g.addV('SKILL').property('id', 'S3').property('name', 'Innovation')"
]

EDGES = [
    "g.V('P1').addE('manages').to(g.V('P2'))",
    "g.V('P2').addE('managed by').to(g.V('P1'))",
    "g.V('P1').addE('works for').to(g.V('C1'))",
    "g.V('P2').addE('works for').to(g.V('C1'))",
    "g.V('P1').addE('competent in').to(g.V('S1'))",
    "g.V('P2').addE('competent in').to(g.V('S1'))",
    "g.V('P2').addE('competent in').to(g.V('S2'))",
    "g.V('P2').addE('competent in').to(g.V('S3'))"
]

2.2 Functions

  1. Initialise the client.
  2. Purge the graph database of any existing content.
  3. Insert the vertices.
  4. Insert the edges.
def cleanup_graph(gremlin_client):
    callback = gremlin_client.submitAsync("g.V().drop()")
    if callback.result() is not None:
        print("Cleaned up the graph!")

def insert_vertices(gremlin_client):
    for vertex in VERTICES:
        callback = gremlin_client.submitAsync(vertex)
        if callback.result() is not None:
            print("Inserted this vertex:\n{0}".format(callback.result().one()))
        else:
            print("Something went wrong with this query: {0}".format(vertex))

def insert_edges(gremlin_client):
    for edge in EDGES:
        callback = gremlin_client.submitAsync(edge)
        if callback.result() is not None:
            print("Inserted this edge:\n{0}".format(callback.result().one()))
        else:
            print("Something went wrong with this query:\n{0}".format(edge))

def handler():
    # Initialise client
    print('Initialising client...')
    gremlin_client = client.Client(
        'wss://' + ENDPOINT + ':443/', 'g',
        username="/dbs/" + DATABASE + "/colls/" + COLLECTION,
        password=PRIMARY_KEY
    )
    print('Client initialised!')

    # Purge graph
    cleanup_graph(gremlin_client)

    # Insert vertices (i.e. nodes)
    insert_vertices(gremlin_client)

    # Insert edges (i.e. nodes)
    insert_edges(gremlin_client)

    print('Finished!')

if __name__ == '__main__':
    handler()

2.3 Output Visualised
If successful, navigate back to Cosmos DB > Data Explorer > Database > Collection and execute the command g.V() to return the graph and see the results visualised.

mini_graph.png

If we focus on a particular vertex, we can get a summary of the Properties, Sources and Targets.

vertex.png

Now that we have some data we can query and traverse our graph.

Return all people
g.V().hasLabel('PERSON')

Return all people with the title CEO
g.V().hasLabel('PERSON').has('title', 'CEO')

Extract the name value from all skills
g.V().hasLabel('SKILL').values('name')

Get all people that Tim manages
g.V('P1').out('manages').hasLabel('PERSON').values('name')

Get all skills that Tim's team is competent in
 
g.V('P1').out('manages').hasLabel('PERSON').out('competent in').hasLabel('SKILL').values('name')

2.4 Complete Code

from gremlin_python.driver import client

ENDPOINT = 'ENTER_YOUR_COSMOS_DB.gremlin.cosmosdb.azure.com'
DATABASE = 'ENTER_YOUR_GRAPH_DB'
COLLECTION = 'ENTER_YOUR_GRAPH_COLLECTION'
PRIMARY_KEY = 'ENTER_YOUR_PRIMARY_KEY'

VERTICES = [
    "g.addV('PERSON').property('id', 'P1').property('name', 'Tim Cook').property('title', 'CEO')",
    "g.addV('PERSON').property('id', 'P2').property('name', 'Jonathan Ive').property('title', 'Chief Design Officer')",
    "g.addV('COMPANY').property('id', 'C1').property('name', 'Apple').property('location', 'California, USA')",
    "g.addV('SKILL').property('id', 'S1').property('name', 'Leadership')",
    "g.addV('SKILL').property('id', 'S2').property('name', 'Design')",
    "g.addV('SKILL').property('id', 'S3').property('name', 'Innovation')"
]

EDGES = [
    "g.V('P1').addE('manages').to(g.V('P2'))",
    "g.V('P2').addE('managed by').to(g.V('P1'))",
    "g.V('P1').addE('works for').to(g.V('C1'))",
    "g.V('P2').addE('works for').to(g.V('C1'))",
    "g.V('P1').addE('competent in').to(g.V('S1'))",
    "g.V('P2').addE('competent in').to(g.V('S1'))",
    "g.V('P2').addE('competent in').to(g.V('S2'))",
    "g.V('P2').addE('competent in').to(g.V('S3'))"
]

def cleanup_graph(gremlin_client):
    callback = gremlin_client.submitAsync("g.V().drop()")
    if callback.result() is not None:
        print("Cleaned up the graph!")

def insert_vertices(gremlin_client):
    for vertex in VERTICES:
        callback = gremlin_client.submitAsync(vertex)
        if callback.result() is not None:
            print("Inserted this vertex:\n{0}".format(callback.result().one()))
        else:
            print("Something went wrong with this query: {0}".format(vertex))

def insert_edges(gremlin_client):
    for edge in EDGES:
        callback = gremlin_client.submitAsync(edge)
        if callback.result() is not None:
            print("Inserted this edge:\n{0}".format(callback.result().one()))
        else:
            print("Something went wrong with this query:\n{0}".format(edge))

def handler():
    # Initialise client
    print('Initialising client...')
    gremlin_client = client.Client(
        'wss://' + ENDPOINT + ':443/', 'g',
        username="/dbs/" + DATABASE + "/colls/" + COLLECTION,
        password=PRIMARY_KEY
    )
    print('Client initialised!')

    # Purge graph
    cleanup_graph(gremlin_client)

    # Insert vertices (i.e. nodes)
    insert_vertices(gremlin_client)

    # Insert edges (i.e. nodes)
    insert_edges(gremlin_client)

    print('Finished!')

if __name__ == '__main__':
    handler()

Real World Use Cases

So all this is well and good but how can we apply this in the real world?

  • Social Networks - Facebook, LinkedIn, Snapchat, Twitter.
  • Recommendation Engines - Airbnb, Netflix, Expedia.
  • Human Capital Management - e.g. Vacancy can be filled because Person A has competencies in Skills X.
  • Impact Analysis - Networks, Data Lineage.
  • Fraud Detection - Find patterns in data that do not align with expected behaviors.

Hopefully this gives you some insight into graph databases, what they are, and how they can be leveraged to derive value out of highly connected data.

- fin -