Azure Data Lake Series: Working with JSON - Part 2

Important to Note: If you are just beginning and trying to figure out how to parse JSON documents with U-SQL and Azure Data Lake Analytics, I highly recommend kicking off with Part 1 in this series.

Prerequisites

An Azure subscription
An Azure Data Lake Store account
An Azure Data Lake Analytics account
Uploaded and registered custom .NET JSON assemblies (see Part 1)
Uploaded exercise02.json and exercise03.json files to your Azure Data Lake Store

Exercise #2 - Array of Objects

In Exercise #1, we started off with a very basic example. A single object (e.g. a movie) with some attributes (e.g. title, genre, rating). But how would our code need to change in order to parse an array of movies?

Input: exercise02.json

{
    "movies": [
        {
            "title": "The Godfather",
            "rating": "R",
            "genre": "Crime, Drama",
            "year": 1972
        },
        {
            "title": "The Dark Knight",
            "rating": "M",
            "genre": "Action, Crime, Drama",
            "year": 2008
        },
        {
            "title": "Pulp Fiction",
            "rating": "R",
            "genre": "Crime, Drama",
            "year": 1994
        },
        {
            "title": "Fight Club",
            "rating": "R",
            "genre": "Drama",
            "year": 1999
        }
    ]

}

Transformation: exercise02.usql
Note that the change is very subtle, the only difference between the U-SQL required in Exercise #1 where we only needed to parse a single object compared to the U-SQL below where we have an array of objects is in line 25. We have passed a string parameter with the value "$.movies[*]" through to JsonExtractor(). This parameter value is a valid JSONPath expression instructing the JsonExtractor to loop through each movie within the array.

Tip: There are a number of online tools such as jsonpath.com that can assist with constructing valid JSONPath expressions.

// A. CREATE ASSEMBLY: Register assemblies (if they do not already exist).
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM @"/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM @"/Microsoft.Analytics.Samples.Formats.dll";
 
// B. REFERENCE ASSEMBLY: Load assemblies for compile time and execution.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

// C. USING: Specify namespace to shorten function names (e.g. Microsoft.Analytics.Samples.Formats.Json.JsonExtractor)
USING Microsoft.Analytics.Samples.Formats.Json;

// 1. Initialise variables for Input (e.g. JSON) and Output (e.g. CSV).
DECLARE @InputFile string = @"/exercise02.json";
DECLARE @OutputFile string = @"/exercise02.csv";

// 2. Extract string content from JSON document (i.e. schema on read).
@json =
EXTRACT
    title string,
    rating string,
    genre string,
    year int
FROM
    @InputFile
USING new JsonExtractor("$.movies[*]");

// 3. Write values to CSV
OUTPUT @json
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);

Output: exercise02.csv

"title","rating","genre","year"
"The Godfather","R","Crime, Drama",1972
"The Dark Knight","M","Action, Crime, Drama",2008
"Pulp Fiction","R","Crime, Drama",1994
"Fight Club","R","Drama",1999

Exercise #3 - Nested Arrays

Based on the examples thus far we know that a single movie can have one or more genres. What if this one-to-many relationship was depicted as a nested array of key value pairs (e.g. Genre ID and Genre Text)?

Input: exercise03.json

{
    "movies": [
        {
            "title": "The Godfather",
            "rating": "R",
            "genre": [
                {
                    "genre_id": "001",
                    "genre_text": "Crime"
                },
                {
                    "genre_id": "002",
                    "genre_text": "Drama"
                }
            ],
            "year": 1972
        },
        {
            "title": "The Dark Knight",
            "rating": "M",
            "genre": [
                {
                    "genre_id": "001",
                    "genre_text": "Crime"
                },
                {
                    "genre_id": "002",
                    "genre_text": "Drama"
                },
                {
                    "genre_id": "003",
                    "genre_text": "Action"
                }
            ],
            "year": 2008
        },
        {
            "title": "Pulp Fiction",
            "rating": "R",
            "genre": [
                {
                    "genre_id": "001",
                    "genre_text": "Crime"
                },
                {
                    "genre_id": "002",
                    "genre_text": "Drama"
                }
            ],
            "year": 1994
        },
        {
            "title": "Fight Club",
            "rating": "R",
            "genre": [
                {
                    "genre_id": "002",
                    "genre_text": "Drama"
                }
            ],
            "year": 1999
        }
    ]

}

Transformation: exercise03.usql
See the embedded comments in the code below for detailed explanation as to what is occurring at each step.

Highlights:

In step 3 we extract a nested array as a value of a tuple.
In step 4 we unpack the array and generate a row per genre using CROSS APPLY EXPLODE.
In step 5 we aggregate using ARRAY_AGG and concatenate to flatten our data set into a single row per movie.

// A. CREATE ASSEMBLY: Register assemblies (if they do not already exist).
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM @"/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM @"/Microsoft.Analytics.Samples.Formats.dll";
 
// B. REFERENCE ASSEMBLY: Load assemblies for compile time and execution.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

// C. USING: Specify namespace to shorten function names (e.g. Microsoft.Analytics.Samples.Formats.Json.JsonExtractor)
USING Microsoft.Analytics.Samples.Formats.Json;

// 1. Initialise variables for Input (e.g. JSON) and Output (e.g. CSV).
DECLARE @InputFile string = @"/exercise03.json";
DECLARE @OutputFile string = @"/exercise03.csv";

// 2. Extract string content from JSON document (i.e. schema on read).
@json =
EXTRACT
    title string,
    rating string,
    genre string,
    year int
FROM @InputFile
USING new JsonExtractor("$.movies[*]");

// 3a. Convert the genre string into a key/value pair using JsonTuple()
// 3b. Extract the genre value (i.e. an array of genres)
@movies =
SELECT
    title,
    rating,
    JsonFunctions.JsonTuple(genre).Values AS genre_array,
    year
FROM @json;

// 4a. CROSS APPLY EXPLODE unpacks the rows (i.e. genres)
// 4b. The genre object is converted into a key/value pair using JsonTuple()
// 4c. By passing in the key "genre_text" we have soley requested the genre's friendly name.
@movies_explode =
SELECT
    title,
    rating,
    JsonFunctions.JsonTuple(genre)["genre_text"] AS genre_text,
    year
FROM @movies
    CROSS APPLY
        EXPLODE(genre_array) AS g(genre);

// NOTE: If we were to stop at this point, our CSV would appear to
// have duplicates as there would be a row generated per genre.

// 5. Aggregate and concatenate the genres into a single row.
@movies_agg =
SELECT
    title,
    rating,
    string.Join(", ", ARRAY_AGG(genre_text)) AS genre_text,
    year
FROM @movies_explode
GROUP BY title, rating, year;

// 6. Write values to CSV
OUTPUT @movies_agg
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);

Output: exercise03.csv

"title","rating","genre_text","year"
"Fight Club","R","Drama",1999
"Pulp Fiction","R","Crime, Drama",1994
"The Dark Knight","M","Crime, Drama, Action",2008
"The Godfather","R","Crime, Drama",1972

Next Steps

Check out Azure Data Lake Series: Working with JSON - Part 3 to see a real-world example where we will parse large, heavily nested JSON documents that contain data on over 20,000 restaurants across the City of London split across 80+ files. I will show how we handle multi-level value extraction, potential inconsistencies in our data (i.e. nulls) and finally how we can massage/cleanse our data before returning our output.

Taygan

Taygan

BLOG

Taygan

Azure Data Lake Series: Working with JSON - Part 2

Prerequisites

Exercise #2 - Array of Objects

Exercise #3 - Nested Arrays

Video: Mastering JSON in Azure Data Lake with U-SQL

Taygan