Azure Data Lake Series: Working with JSON - Part 2
/Important to Note: If you are just beginning and trying to figure out how to parse JSON documents with U-SQL and Azure Data Lake Analytics, I highly recommend kicking off with Part 1 in this series.
Prerequisites
- An Azure subscription
- An Azure Data Lake Store account
- An Azure Data Lake Analytics account
- Uploaded and registered custom .NET JSON assemblies (see Part 1)
- Uploaded exercise02.json and exercise03.json files to your Azure Data Lake Store
Exercise #2 - Array of Objects
In Exercise #1, we started off with a very basic example. A single object (e.g. a movie) with some attributes (e.g. title, genre, rating). But how would our code need to change in order to parse an array of movies?
Input: exercise02.json
{
"movies": [
{
"title": "The Godfather",
"rating": "R",
"genre": "Crime, Drama",
"year": 1972
},
{
"title": "The Dark Knight",
"rating": "M",
"genre": "Action, Crime, Drama",
"year": 2008
},
{
"title": "Pulp Fiction",
"rating": "R",
"genre": "Crime, Drama",
"year": 1994
},
{
"title": "Fight Club",
"rating": "R",
"genre": "Drama",
"year": 1999
}
]
}
Transformation: exercise02.usql
Note that the change is very subtle, the only difference between the U-SQL required in Exercise #1 where we only needed to parse a single object compared to the U-SQL below where we have an array of objects is in line 25. We have passed a string parameter with the value "$.movies[*]" through to JsonExtractor(). This parameter value is a valid JSONPath expression instructing the JsonExtractor to loop through each movie within the array.
Tip: There are a number of online tools such as jsonpath.com that can assist with constructing valid JSONPath expressions.
// A. CREATE ASSEMBLY: Register assemblies (if they do not already exist).
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM @"/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM @"/Microsoft.Analytics.Samples.Formats.dll";
// B. REFERENCE ASSEMBLY: Load assemblies for compile time and execution.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
// C. USING: Specify namespace to shorten function names (e.g. Microsoft.Analytics.Samples.Formats.Json.JsonExtractor)
USING Microsoft.Analytics.Samples.Formats.Json;
// 1. Initialise variables for Input (e.g. JSON) and Output (e.g. CSV).
DECLARE @InputFile string = @"/exercise02.json";
DECLARE @OutputFile string = @"/exercise02.csv";
// 2. Extract string content from JSON document (i.e. schema on read).
@json =
EXTRACT
title string,
rating string,
genre string,
year int
FROM
@InputFile
USING new JsonExtractor("$.movies[*]");
// 3. Write values to CSV
OUTPUT @json
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);
Output: exercise02.csv
"title","rating","genre","year"
"The Godfather","R","Crime, Drama",1972
"The Dark Knight","M","Action, Crime, Drama",2008
"Pulp Fiction","R","Crime, Drama",1994
"Fight Club","R","Drama",1999
Exercise #3 - Nested Arrays
Based on the examples thus far we know that a single movie can have one or more genres. What if this one-to-many relationship was depicted as a nested array of key value pairs (e.g. Genre ID and Genre Text)?
Input: exercise03.json
{
"movies": [
{
"title": "The Godfather",
"rating": "R",
"genre": [
{
"genre_id": "001",
"genre_text": "Crime"
},
{
"genre_id": "002",
"genre_text": "Drama"
}
],
"year": 1972
},
{
"title": "The Dark Knight",
"rating": "M",
"genre": [
{
"genre_id": "001",
"genre_text": "Crime"
},
{
"genre_id": "002",
"genre_text": "Drama"
},
{
"genre_id": "003",
"genre_text": "Action"
}
],
"year": 2008
},
{
"title": "Pulp Fiction",
"rating": "R",
"genre": [
{
"genre_id": "001",
"genre_text": "Crime"
},
{
"genre_id": "002",
"genre_text": "Drama"
}
],
"year": 1994
},
{
"title": "Fight Club",
"rating": "R",
"genre": [
{
"genre_id": "002",
"genre_text": "Drama"
}
],
"year": 1999
}
]
}
Transformation: exercise03.usql
See the embedded comments in the code below for detailed explanation as to what is occurring at each step.
Highlights:
- In step 3 we extract a nested array as a value of a tuple.
- In step 4 we unpack the array and generate a row per genre using CROSS APPLY EXPLODE.
- In step 5 we aggregate using ARRAY_AGG and concatenate to flatten our data set into a single row per movie.
// A. CREATE ASSEMBLY: Register assemblies (if they do not already exist).
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM @"/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM @"/Microsoft.Analytics.Samples.Formats.dll";
// B. REFERENCE ASSEMBLY: Load assemblies for compile time and execution.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
// C. USING: Specify namespace to shorten function names (e.g. Microsoft.Analytics.Samples.Formats.Json.JsonExtractor)
USING Microsoft.Analytics.Samples.Formats.Json;
// 1. Initialise variables for Input (e.g. JSON) and Output (e.g. CSV).
DECLARE @InputFile string = @"/exercise03.json";
DECLARE @OutputFile string = @"/exercise03.csv";
// 2. Extract string content from JSON document (i.e. schema on read).
@json =
EXTRACT
title string,
rating string,
genre string,
year int
FROM @InputFile
USING new JsonExtractor("$.movies[*]");
// 3a. Convert the genre string into a key/value pair using JsonTuple()
// 3b. Extract the genre value (i.e. an array of genres)
@movies =
SELECT
title,
rating,
JsonFunctions.JsonTuple(genre).Values AS genre_array,
year
FROM @json;
// 4a. CROSS APPLY EXPLODE unpacks the rows (i.e. genres)
// 4b. The genre object is converted into a key/value pair using JsonTuple()
// 4c. By passing in the key "genre_text" we have soley requested the genre's friendly name.
@movies_explode =
SELECT
title,
rating,
JsonFunctions.JsonTuple(genre)["genre_text"] AS genre_text,
year
FROM @movies
CROSS APPLY
EXPLODE(genre_array) AS g(genre);
// NOTE: If we were to stop at this point, our CSV would appear to
// have duplicates as there would be a row generated per genre.
// 5. Aggregate and concatenate the genres into a single row.
@movies_agg =
SELECT
title,
rating,
string.Join(", ", ARRAY_AGG(genre_text)) AS genre_text,
year
FROM @movies_explode
GROUP BY title, rating, year;
// 6. Write values to CSV
OUTPUT @movies_agg
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);
Output: exercise03.csv
"title","rating","genre_text","year"
"Fight Club","R","Drama",1999
"Pulp Fiction","R","Crime, Drama",1994
"The Dark Knight","M","Crime, Drama, Action",2008
"The Godfather","R","Crime, Drama",1972
Next Steps
Check out Azure Data Lake Series: Working with JSON - Part 3 to see a real-world example where we will parse large, heavily nested JSON documents that contain data on over 20,000 restaurants across the City of London split across 80+ files. I will show how we handle multi-level value extraction, potential inconsistencies in our data (i.e. nulls) and finally how we can massage/cleanse our data before returning our output.