DP-200: Implementing an Azure Data Solution - Exam Prep

A collection of resources, study notes, and learning material that helped me, and can hopefully help others, prepare for and pass exam DP-200: Implementing an Azure Data Solution. Note: Passing DP-200 is one out of two steps required to become a Microsoft Certified: Azure Data Engineer, you must pass both DP-200 and DP-201.

Suggested Approach

Briefly skim through and familiarise yourself with the list of Resources and Skills Measured below (Tip: In regards to the Skills Measured, refer to the latest skills outline available directly from the exam home page as the content changes from time to time. The copy in this article is as of the 31st December 2019, key phrases have been pre-highlighted to aid learning).
Complete the DP-200 Microsoft Learn Collection. This is a custom learning path that will provide decent coverage across the various topics and technologies included in the exam.
Re-visit the Skills Measured section below upon completing the learning path and dive into the links for any areas that were not covered by the Microsoft Learn collection and/or require a deeper understanding.
Lastly, read-over the Reference Tables to consume succinct summaries of key topics and comparisons.

Resources

↥ back to top

Resource	Link
Exam	Exam DP-200: Implementing an Azure Data Solution
Related Certification	Microsoft Certified: Azure Data Engineer Associate
Microsoft Learn	Azure Data Engineer Learning Paths
Hands-On Labs	Microsoft Learning - GitHub: DP-200 Hands-On Labs
MS Learn Collection	Custom Microsoft Learn Collection for DP-200

Skills Measured

↥ back to top

Note: The content of the exam was updated on the 4th December 2019 to include Azure Data Explorer and rename references of “SQL Data Warehouse” to “Azure Synapse Analytics”. Tip: Always refer to the latest skills outline available on the exam home page in case of any recent content changes.

Approximate Split of Content by Section

IDSS = Implement Data Storage Solutions; M&ODS = Manage & Optimize Data Solutions; M&DDP = Manage & Develop Data Processing

1. Implement data storage solutions (40-45%)

Implement non-relational data stores

implement a solution that uses Cosmos DB[1] , Data Lake Storage Gen2[2], or Blob storage[3]
implement data distribution and partitions [1] [2]
implement a consistency model in Cosmos DB [1] [2]
provision a non-relational data store [1] [2]
provide access to data to meet security requirements [1]
implement for high availability, disaster recovery, and global distribution [1] [2] [3]

Implement relational data stores

configure elastic pools [1]
configure geo-replication [1]
provide access to data to meet security requirements [1] [2] [3]
implement for high availability, disaster recovery, and global distribution [1] [2] [3]
implement data distribution and partitions for Azure Synapse Analytics [1] [2]
Implement PolyBase [1] [2]

Manage data security

implement data masking [1] [2] [3]
encrypt data at rest and in motion [1] [2]

2. Manage and develop data processing (25-30%)

Develop batch processing solutions

develop batch processing solutions by using Data Factory and Azure Databricks [1]
ingest data by using PolyBase [1] [2]
implement the integration runtime for Data Factory [1] [2] [3]
create linked services and datasets [1] [2]
create pipelines and activities [1]
create and schedule triggers [1]
implement Azure Databricks clusters[1], notebooks[2], jobs[3], and autoscaling[4]
ingest data into Azure Databricks [1]

Develop streaming solutions

configure input and output [1] [2]
select the appropriate windowing functions [1] [2]
implement event processing by using Stream Analytics [1] [2]
ingest and query streaming data with Azure Data Explorer [1] [2]

3. Monitor and optimize data solutions (30-35%)

Monitor data storage

monitor relational and non-relational data sources
implement Blob storage monitoring [1] [2]
implement Data Lake Store monitoring [1]
implement SQL Database monitoring [1] [2]
implement Azure Synapse Analytics monitoring [1]
implement Cosmos DB monitoring [1] [2]
implement Azure Data Explorer monitoring [1] [2]
configure Azure Monitor alerts [1]
implement auditing by using Azure Log Analytics [1]

Monitor data processing

design and implement Data Factory monitoring [1] [2]
monitor Azure Databricks [1] [2]
monitor Stream Analytics [1] [2]
configure Azure Monitor alerts [1]
implement auditing by using Azure Log Analytics [1]

Optimize Azure data solutions

troubleshoot data partitioning bottlenecks [1] [2]
optimize Data Lake Storage [1]
optimize Stream Analytics [1] [2] [3]
optimize Azure Synapse Analytics [1] [2]
optimize SQL Database [1]
manage data life cycle [1]

Reference Tables

↥ back to top

Cosmos DB - Consistency Levels

Level	Description
Strong	Strongest Consistency; Highest Latency;
Bounded Staleness	Reads may lag behind writes at most k versions or t interval
Session	Scoped to a client session; Guaranteed to read your own writes;
Consistent Prefix	Reads never see out of order writes
Eventual	Weakest Consistency; Lowest Latency; Eventual Convergence;

Source: Consistency levels in Azure Cosmos DB

Cosmos DB - Entities by API

Entity	SQL	Cassandra	MongoDB	Gremlin	Table
Database	Database	Keyspace	Database	Database	N/A
Container	Container	Table	Collection	Graph	Table
Item	Document	Row	Document	Node or Edge	Item

Source: Work with databases, containers, and items in Azure Cosmos DB

Azure Synapse Analytics - Distributed or Replicated Tables

Type	Great fit for...
Replicated	Small dimension tables (<2GB)
Round Robin (default)	Temporary/Staging Table
Hash	Fact Tables; Large Dimension Tables

Source: Cheat sheet for Azure Synapse Analytics (formerly SQL DW)

Azure SQL Database - Dynamic Data Masking

Function	Description	Example
Default	Full masking according to the data types of the designated fields	Default value (0, xxxx, 01-01-1900)
Credit Card	Masking method, which exposes the last four digits of the designated fields	XXXX-XXXX-XXXX-1234
Email	Masking method, which exposes the first letter and replaces the domain with XXX.com	aXX@XXXX.com
Random Number	Masking method, which generates a random number	Random number range
Custom Text	Masking method, which exposes the first and last characters	Custom string (prefix [padding] suffix)

Source: Dynamic data masking for Azure SQL Database and Data Warehouse

Azure SQL Database - Service Tiers

	Basic	Standard	Premium
Target Workload	Development & Production	Development & Production	Development & Production
Uptime SLA	99.9%	99.9%	99.9%
Maximum Backup Retention	7 days	35 days	35 days
CPU	Low	Low, Medium, High	Medium, High
IO Throughput (approximate)	1-5 IOPS per DTU	1-5 IOPS per DTU	25 IOPS per DTU
IO Latency (approximate)	5ms (read), 10ms (write)	5ms (read), 10ms (write)	2ms (read/write)
Columnstore indexing	N/A	S3 and above	Supported
In-memory OLTP	N/A	N/A	Supported

Source: Service tiers in the DTU-based purchase model

Azure SQL Database - General Purpose vs Hyperscale

	General Purpose	Hyperscale
Best for...	Budget orientated; Balanced compute and storage;	Auto-scaling storage size; Vertical and horizontal scaling;
Storage Type	Premium Remote Storage (per instance)	De-coupled storage with local SSD cache (per instance)
Database Size	5 GB - 4 TB	Up to 100 TB
Log Write Throughput	1.875 MB/s per vCore (max 30 MB/s)	100 MB/s

Source: Azure SQL Database service tiers

Azure SQL Database - Security Overview

Layer	Type	Description
Network	IP Firewall Rules	Grant access to databases based on the originating IP address of each request.
Network	Virtual Network Firewall Rules	Only accept communications that are sent from selected subnets inside a virtual network.
Access Management	SQL Authentication	Authentication of a user when using a username and password.
Access Management	Azure AD Authentication	Leverage centrally managed identities in Azure Active Directory (Azure AD).
Authorization	Row-level Security	Control access to rows in a table based on the characteristics of the user/query.
Threat Protection	Auditing	Tracks database activities by recording events to an audit log in an Azure storage account.
Threat Protection	Advanced Threat Protection	Analyzing SQL Server logs to detect unusual and potentially harmful behavior.
Information Protection	Transport Layer Security (TLS)	Encryption-in-transit between client and server.
Information Protection	Transparent Data Encryption (TDE)	Encryption-at-rest using AES (Azure SQL DB encrypted by default).
Information Protection	Always Encrypted	Encryption-in-use (Column-level granularity; Decrypted only for processing by client).
Information Protection	Dynamic Data Masking	Limits sensitive data exposure by masking it to non-privileged users.
Security Management	Vulnerability Assessment	Discover, track, and help remediate potential database vulnerabilities.
Security Management	Data Discovery & Classification	Discovering, classifying, labeling, and protecting the sensitive data in your databases.
Security Management	Compliance	Been certified against a number of compliance standards.

Source: An overview of Azure SQL Database security capabilities

Data Migration Tools and Services

Tool	Description
Data Migration Assistant (DMA)	Helps you upgrade to a modern data platform by detecting compatibility issues that can impact database functionality in your new version of SQL Server or Azure SQL Database.
Azure Database Migration Service (DMS)	A fully managed service designed to enable seamless migrations from multiple database sources to Azure data platforms with minimal downtime (online migrations).
SQL Data Sync	A service built on Azure SQL Database that lets you synchronize the data you select bi-directionally across multiple SQL databases and SQL Server instances.
Database Experimentation Assistant (DEA)	Helps you evaluate a targeted version of SQL Server for a specific workload. Customers upgrading from earlier versions of SQL Server (starting with 2005) to more recent versions of SQL Server can use the analysis metrics that the tool provides.
SQL Server Migration Assistant (SSMA)	A tool designed to automate database migration to SQL Server from Microsoft Access, DB2, MySQL, Oracle, and SAP ASE.

Source: Azure Database Migration Guide

Databricks - Cluster Configurations

	Standard	High Concurrency
Recommended for...	Single User	Multiple Users
Language Support	SQL, Python, R, and Scala	SQL, Python, and R (not Scala)
Notebook Isolation	No	Yes

Source: Cluster Configurations

Azure Storage - Redundancy

Abbreviation	Full Name	Description
LRS	Locally Redundant Storage	Replicates your data three times within a single data center
ZRS	Zone-Redundant Storage	Replicates your data across three storage clusters in a single region.
GRS	Geo-Redundant Storage	Replicates your data to a secondary region. Can withstand regional outage.
RA-GRS	Read-Access Geo-Redundant Storage	Provides read-only access to the data in the secondary location, in addition to GRS.
GZRS	Geo-Zone-Redundant Storage	Replicates data across three Azure Availability Zones in two regions.
RA-GZRS	Read-Access Geo-Zone-Redundant Storage	Provides read-only access to the data in the secondary location, in addition to GZRS.

Source: Azure Storage redundancy

Azure Storage - Redundancy Scenarios

Scenario	LRS	ZRS	GRS	GZRS
Node Unavailability (within a DC)	Yes	Yes	Yes	Yes
DC Outage	No	Yes	Yes	Yes
Region Outage	No	No	Yes	Yes
Read-Access in the event of a Region Outage	No	No	Yes with RA-GRS	Yes with RA-GZRS
Durability over a given year	11 9's	12 9's	16 9's	16 9's

Note: Maximum Durability = Geo (Node > DC > Region); High Availability = ZRS/GZRS; Maximum Availability = RA-ZRS/GZRS;

Source: Azure Storage redundancy

Azure Data Factory - Triggers

Type	Description
Schedule	Runs on a wall-clock schedule (e.g. every X mins/h/d/w/m's).
Tumbling Window	A series of fixed-sized, non-overlapping, and contiguous time intervals.
Event-based	Runs pipelines in response to an event (e.g. Blob Created/Deleted).

Note: Only Tumbling Window triggers support advanced properties such as Trigger Dependencies, Concurrency and Retry Policy.

Source: Pipeline execution and triggers in Azure Data Factory

Azure Data Factory - Integration Runtime Types

Type	Capability	Network Support
Azure	Data Flow, Data Movement, Activity Dispatch	Public
Self-Hosted	Data Movement, Activity Dispatch	Public or Private
Azure-SSIS	SSIS Package Execution	Public or Private

Source: Integration runtime in Azure Data Factory

↥ back to top

Taygan

Taygan

BLOG

Taygan

DP-200: Implementing an Azure Data Solution - Exam Prep

Suggested Approach

Resources

Skills Measured

Approximate Split of Content by Section

Reference Tables

Cosmos DB - Consistency Levels

Cosmos DB - Entities by API

Azure Synapse Analytics - Distributed or Replicated Tables

Azure SQL Database - Dynamic Data Masking

Azure SQL Database - Service Tiers

Azure SQL Database - General Purpose vs Hyperscale

Azure SQL Database - Security Overview

Data Migration Tools and Services

Databricks - Cluster Configurations

Azure Storage - Redundancy

Azure Storage - Redundancy Scenarios

Azure Data Factory - Triggers

Azure Data Factory - Integration Runtime Types

Taygan