Azure Data Lake Series: Working with JSON - Part 2

Important to Note: If you are just beginning and trying to figure out how to parse JSON documents with U-SQL and Azure Data Lake Analytics, I highly recommend kicking off with Part 1 in this series.

Prerequisites

Exercise #2 - Array of Objects

In Exercise #1, we started off with a very basic example. A single object (e.g. a movie) with some attributes (e.g. title, genre, rating). But how would our code need to change in order to parse an array of movies?

Input: exercise02.json

{
    "movies": [
        {
            "title": "The Godfather",
            "rating": "R",
            "genre": "Crime, Drama",
            "year": 1972
        },
        {
            "title": "The Dark Knight",
            "rating": "M",
            "genre": "Action, Crime, Drama",
            "year": 2008
        },
        {
            "title": "Pulp Fiction",
            "rating": "R",
            "genre": "Crime, Drama",
            "year": 1994
        },
        {
            "title": "Fight Club",
            "rating": "R",
            "genre": "Drama",
            "year": 1999
        }
    ]

}

Transformation: exercise02.usql
Note that the change is very subtle, the only difference between the U-SQL required in Exercise #1 where we only needed to parse a single object compared to the U-SQL below where we have an array of objects is in line 25. We have passed a string parameter with the value "$.movies[*]" through to JsonExtractor(). This parameter value is a valid JSONPath expression instructing the JsonExtractor to loop through each movie within the array.

Tip: There are a number of online tools such as jsonpath.com that can assist with constructing valid JSONPath expressions.

// A. CREATE ASSEMBLY: Register assemblies (if they do not already exist).
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM @"/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM @"/Microsoft.Analytics.Samples.Formats.dll";
 
// B. REFERENCE ASSEMBLY: Load assemblies for compile time and execution.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

// C. USING: Specify namespace to shorten function names (e.g. Microsoft.Analytics.Samples.Formats.Json.JsonExtractor)
USING Microsoft.Analytics.Samples.Formats.Json;

// 1. Initialise variables for Input (e.g. JSON) and Output (e.g. CSV).
DECLARE @InputFile string = @"/exercise02.json";
DECLARE @OutputFile string = @"/exercise02.csv";

// 2. Extract string content from JSON document (i.e. schema on read).
@json =
EXTRACT
    title string,
    rating string,
    genre string,
    year int
FROM
    @InputFile
USING new JsonExtractor("$.movies[*]");

// 3. Write values to CSV
OUTPUT @json
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);

Output: exercise02.csv

"title","rating","genre","year"
"The Godfather","R","Crime, Drama",1972
"The Dark Knight","M","Action, Crime, Drama",2008
"Pulp Fiction","R","Crime, Drama",1994
"Fight Club","R","Drama",1999

Exercise #3 - Nested Arrays

Based on the examples thus far we know that a single movie can have one or more genres. What if this one-to-many relationship was depicted as a nested array of key value pairs (e.g. Genre ID and Genre Text)?

Input: exercise03.json

{
    "movies": [
        {
            "title": "The Godfather",
            "rating": "R",
            "genre": [
                {
                    "genre_id": "001",
                    "genre_text": "Crime"
                },
                {
                    "genre_id": "002",
                    "genre_text": "Drama"
                }
            ],
            "year": 1972
        },
        {
            "title": "The Dark Knight",
            "rating": "M",
            "genre": [
                {
                    "genre_id": "001",
                    "genre_text": "Crime"
                },
                {
                    "genre_id": "002",
                    "genre_text": "Drama"
                },
                {
                    "genre_id": "003",
                    "genre_text": "Action"
                }
            ],
            "year": 2008
        },
        {
            "title": "Pulp Fiction",
            "rating": "R",
            "genre": [
                {
                    "genre_id": "001",
                    "genre_text": "Crime"
                },
                {
                    "genre_id": "002",
                    "genre_text": "Drama"
                }
            ],
            "year": 1994
        },
        {
            "title": "Fight Club",
            "rating": "R",
            "genre": [
                {
                    "genre_id": "002",
                    "genre_text": "Drama"
                }
            ],
            "year": 1999
        }
    ]

}

Transformation: exercise03.usql
See the embedded comments in the code below for detailed explanation as to what is occurring at each step.

Highlights:

  • In step 3 we extract a nested array as a value of a tuple.
  • In step 4 we unpack the array and generate a row per genre using CROSS APPLY EXPLODE.
  • In step 5 we aggregate using ARRAY_AGG and concatenate to flatten our data set into a single row per movie.
// A. CREATE ASSEMBLY: Register assemblies (if they do not already exist).
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM @"/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM @"/Microsoft.Analytics.Samples.Formats.dll";
 
// B. REFERENCE ASSEMBLY: Load assemblies for compile time and execution.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

// C. USING: Specify namespace to shorten function names (e.g. Microsoft.Analytics.Samples.Formats.Json.JsonExtractor)
USING Microsoft.Analytics.Samples.Formats.Json;

// 1. Initialise variables for Input (e.g. JSON) and Output (e.g. CSV).
DECLARE @InputFile string = @"/exercise03.json";
DECLARE @OutputFile string = @"/exercise03.csv";

// 2. Extract string content from JSON document (i.e. schema on read).
@json =
EXTRACT
    title string,
    rating string,
    genre string,
    year int
FROM @InputFile
USING new JsonExtractor("$.movies[*]");

// 3a. Convert the genre string into a key/value pair using JsonTuple()
// 3b. Extract the genre value (i.e. an array of genres)
@movies =
SELECT
    title,
    rating,
    JsonFunctions.JsonTuple(genre).Values AS genre_array,
    year
FROM @json;

// 4a. CROSS APPLY EXPLODE unpacks the rows (i.e. genres)
// 4b. The genre object is converted into a key/value pair using JsonTuple()
// 4c. By passing in the key "genre_text" we have soley requested the genre's friendly name.
@movies_explode =
SELECT
    title,
    rating,
    JsonFunctions.JsonTuple(genre)["genre_text"] AS genre_text,
    year
FROM @movies
    CROSS APPLY
        EXPLODE(genre_array) AS g(genre);

// NOTE: If we were to stop at this point, our CSV would appear to
// have duplicates as there would be a row generated per genre.

// 5. Aggregate and concatenate the genres into a single row.
@movies_agg =
SELECT
    title,
    rating,
    string.Join(", ", ARRAY_AGG(genre_text)) AS genre_text,
    year
FROM @movies_explode
GROUP BY title, rating, year;

// 6. Write values to CSV
OUTPUT @movies_agg
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);

Output: exercise03.csv

"title","rating","genre_text","year"
"Fight Club","R","Drama",1999
"Pulp Fiction","R","Crime, Drama",1994
"The Dark Knight","M","Crime, Drama, Action",2008
"The Godfather","R","Crime, Drama",1972

Next Steps

Check out Azure Data Lake Series: Working with JSON - Part 3 to see a real-world example where we will parse large, heavily nested JSON documents that contain data on over 20,000 restaurants across the City of London split across 80+ files. I will show how we handle multi-level value extraction, potential inconsistencies in our data (i.e. nulls) and finally how we can massage/cleanse our data before returning our output.

Video: Mastering JSON in Azure Data Lake with U-SQL