DP-200: Implementing an Azure Data Solution - Exam Prep
/A collection of resources, study notes, and learning material that helped me, and can hopefully help others, prepare for and pass exam DP-200: Implementing an Azure Data Solution. Note: Passing DP-200 is one out of two steps required to become a Microsoft Certified: Azure Data Engineer, you must pass both DP-200 and DP-201.
Suggested Approach
Briefly skim through and familiarise yourself with the list of Resources and Skills Measured below (Tip: In regards to the Skills Measured, refer to the latest skills outline available directly from the exam home page as the content changes from time to time. The copy in this article is as of the 31st December 2019, key phrases have been pre-highlighted to aid learning).
Complete the DP-200 Microsoft Learn Collection. This is a custom learning path that will provide decent coverage across the various topics and technologies included in the exam.
Re-visit the Skills Measured section below upon completing the learning path and dive into the links for any areas that were not covered by the Microsoft Learn collection and/or require a deeper understanding.
Lastly, read-over the Reference Tables to consume succinct summaries of key topics and comparisons.
Resources
Resource | Link |
---|---|
Exam | Exam DP-200: Implementing an Azure Data Solution |
Related Certification | Microsoft Certified: Azure Data Engineer Associate |
Microsoft Learn | Azure Data Engineer Learning Paths |
Hands-On Labs | Microsoft Learning - GitHub: DP-200 Hands-On Labs |
MS Learn Collection | Custom Microsoft Learn Collection for DP-200 |
Skills Measured
Note: The content of the exam was updated on the 4th December 2019 to include Azure Data Explorer and rename references of “SQL Data Warehouse” to “Azure Synapse Analytics”. Tip: Always refer to the latest skills outline available on the exam home page in case of any recent content changes.
Approximate Split of Content by Section
1. Implement data storage solutions (40-45%)
Implement non-relational data stores
- implement a solution that uses Cosmos DB[1] , Data Lake Storage Gen2[2], or Blob storage[3]
- implement data distribution and partitions [1] [2]
- implement a consistency model in Cosmos DB [1] [2]
- provision a non-relational data store [1] [2]
- provide access to data to meet security requirements [1]
- implement for high availability, disaster recovery, and global distribution [1] [2] [3]
Implement relational data stores
- configure elastic pools [1]
- configure geo-replication [1]
- provide access to data to meet security requirements [1] [2] [3]
- implement for high availability, disaster recovery, and global distribution [1] [2] [3]
- implement data distribution and partitions for Azure Synapse Analytics [1] [2]
- Implement PolyBase [1] [2]
Manage data security
2. Manage and develop data processing (25-30%)
Develop batch processing solutions
- develop batch processing solutions by using Data Factory and Azure Databricks [1]
- ingest data by using PolyBase [1] [2]
- implement the integration runtime for Data Factory [1] [2] [3]
- create linked services and datasets [1] [2]
- create pipelines and activities [1]
- create and schedule triggers [1]
- implement Azure Databricks clusters[1], notebooks[2], jobs[3], and autoscaling[4]
- ingest data into Azure Databricks [1]
Develop streaming solutions
- configure input and output [1] [2]
- select the appropriate windowing functions [1] [2]
- implement event processing by using Stream Analytics [1] [2]
- ingest and query streaming data with Azure Data Explorer [1] [2]
3. Monitor and optimize data solutions (30-35%)
Monitor data storage
- monitor relational and non-relational data sources
- implement Blob storage monitoring [1] [2]
- implement Data Lake Store monitoring [1]
- implement SQL Database monitoring [1] [2]
- implement Azure Synapse Analytics monitoring [1]
- implement Cosmos DB monitoring [1] [2]
- implement Azure Data Explorer monitoring [1] [2]
- configure Azure Monitor alerts [1]
- implement auditing by using Azure Log Analytics [1]
Monitor data processing
- design and implement Data Factory monitoring [1] [2]
- monitor Azure Databricks [1] [2]
- monitor Stream Analytics [1] [2]
- configure Azure Monitor alerts [1]
- implement auditing by using Azure Log Analytics [1]
Optimize Azure data solutions
Reference Tables
Cosmos DB - Consistency Levels
Level | Description |
---|---|
Strong | Strongest Consistency; Highest Latency; |
Bounded Staleness | Reads may lag behind writes at most k versions or t interval |
Session | Scoped to a client session; Guaranteed to read your own writes; |
Consistent Prefix | Reads never see out of order writes |
Eventual | Weakest Consistency; Lowest Latency; Eventual Convergence; |
Cosmos DB - Entities by API
Entity | SQL | Cassandra | MongoDB | Gremlin | Table |
---|---|---|---|---|---|
Database | Database | Keyspace | Database | Database | N/A |
Container | Container | Table | Collection | Graph | Table |
Item | Document | Row | Document | Node or Edge | Item |
Azure Synapse Analytics - Distributed or Replicated Tables
Type | Great fit for... |
---|---|
Replicated | Small dimension tables (<2GB) |
Round Robin (default) | Temporary/Staging Table |
Hash | Fact Tables; Large Dimension Tables |
Azure SQL Database - Dynamic Data Masking
Function | Description | Example |
---|---|---|
Default | Full masking according to the data types of the designated fields | Default value (0, xxxx, 01-01-1900) |
Credit Card | Masking method, which exposes the last four digits of the designated fields | XXXX-XXXX-XXXX-1234 |
Masking method, which exposes the first letter and replaces the domain with XXX.com | aXX@XXXX.com | |
Random Number | Masking method, which generates a random number | Random number range |
Custom Text | Masking method, which exposes the first and last characters | Custom string (prefix [padding] suffix) |
Azure SQL Database - Service Tiers
Basic | Standard | Premium | |
---|---|---|---|
Target Workload | Development & Production | Development & Production | Development & Production |
Uptime SLA | 99.9% | 99.9% | 99.9% |
Maximum Backup Retention | 7 days | 35 days | 35 days |
CPU | Low | Low, Medium, High | Medium, High |
IO Throughput (approximate) | 1-5 IOPS per DTU | 1-5 IOPS per DTU | 25 IOPS per DTU |
IO Latency (approximate) | 5ms (read), 10ms (write) | 5ms (read), 10ms (write) | 2ms (read/write) |
Columnstore indexing | N/A | S3 and above | Supported |
In-memory OLTP | N/A | N/A | Supported |
Azure SQL Database - General Purpose vs Hyperscale
General Purpose | Hyperscale | |
---|---|---|
Best for... | Budget orientated; Balanced compute and storage; | Auto-scaling storage size; Vertical and horizontal scaling; |
Storage Type | Premium Remote Storage (per instance) | De-coupled storage with local SSD cache (per instance) |
Database Size | 5 GB - 4 TB | Up to 100 TB |
Log Write Throughput | 1.875 MB/s per vCore (max 30 MB/s) | 100 MB/s |
Azure SQL Database - Security Overview
Layer | Type | Description |
---|---|---|
Network | IP Firewall Rules | Grant access to databases based on the originating IP address of each request. |
Network | Virtual Network Firewall Rules | Only accept communications that are sent from selected subnets inside a virtual network. |
Access Management | SQL Authentication | Authentication of a user when using a username and password. |
Access Management | Azure AD Authentication | Leverage centrally managed identities in Azure Active Directory (Azure AD). |
Authorization | Row-level Security | Control access to rows in a table based on the characteristics of the user/query. |
Threat Protection | Auditing | Tracks database activities by recording events to an audit log in an Azure storage account. |
Threat Protection | Advanced Threat Protection | Analyzing SQL Server logs to detect unusual and potentially harmful behavior. |
Information Protection | Transport Layer Security (TLS) | Encryption-in-transit between client and server. |
Information Protection | Transparent Data Encryption (TDE) | Encryption-at-rest using AES (Azure SQL DB encrypted by default). |
Information Protection | Always Encrypted | Encryption-in-use (Column-level granularity; Decrypted only for processing by client). |
Information Protection | Dynamic Data Masking | Limits sensitive data exposure by masking it to non-privileged users. |
Security Management | Vulnerability Assessment | Discover, track, and help remediate potential database vulnerabilities. |
Security Management | Data Discovery & Classification | Discovering, classifying, labeling, and protecting the sensitive data in your databases. |
Security Management | Compliance | Been certified against a number of compliance standards. |
Data Migration Tools and Services
Tool | Description |
---|---|
Data Migration Assistant (DMA) | Helps you upgrade to a modern data platform by detecting compatibility issues that can impact database functionality in your new version of SQL Server or Azure SQL Database. |
Azure Database Migration Service (DMS) | A fully managed service designed to enable seamless migrations from multiple database sources to Azure data platforms with minimal downtime (online migrations). |
SQL Data Sync | A service built on Azure SQL Database that lets you synchronize the data you select bi-directionally across multiple SQL databases and SQL Server instances. |
Database Experimentation Assistant (DEA) | Helps you evaluate a targeted version of SQL Server for a specific workload. Customers upgrading from earlier versions of SQL Server (starting with 2005) to more recent versions of SQL Server can use the analysis metrics that the tool provides. |
SQL Server Migration Assistant (SSMA) | A tool designed to automate database migration to SQL Server from Microsoft Access, DB2, MySQL, Oracle, and SAP ASE. |
Databricks - Cluster Configurations
Standard | High Concurrency | |
---|---|---|
Recommended for... | Single User | Multiple Users |
Language Support | SQL, Python, R, and Scala | SQL, Python, and R (not Scala) |
Notebook Isolation | No | Yes |
Azure Storage - Redundancy
Abbreviation | Full Name | Description |
---|---|---|
LRS | Locally Redundant Storage | Replicates your data three times within a single data center |
ZRS | Zone-Redundant Storage | Replicates your data across three storage clusters in a single region. |
GRS | Geo-Redundant Storage | Replicates your data to a secondary region. Can withstand regional outage. |
RA-GRS | Read-Access Geo-Redundant Storage | Provides read-only access to the data in the secondary location, in addition to GRS. |
GZRS | Geo-Zone-Redundant Storage | Replicates data across three Azure Availability Zones in two regions. |
RA-GZRS | Read-Access Geo-Zone-Redundant Storage | Provides read-only access to the data in the secondary location, in addition to GZRS. |
Azure Storage - Redundancy Scenarios
Scenario | LRS | ZRS | GRS | GZRS |
---|---|---|---|---|
Node Unavailability (within a DC) | Yes | Yes | Yes | Yes |
DC Outage | No | Yes | Yes | Yes |
Region Outage | No | No | Yes | Yes |
Read-Access in the event of a Region Outage | No | No | Yes with RA-GRS |
Yes with RA-GZRS |
Durability over a given year | 11 9's | 12 9's | 16 9's | 16 9's |
Azure Data Factory - Triggers
Type | Description |
---|---|
Schedule | Runs on a wall-clock schedule (e.g. every X mins/h/d/w/m's). |
Tumbling Window | A series of fixed-sized, non-overlapping, and contiguous time intervals. |
Event-based | Runs pipelines in response to an event (e.g. Blob Created/Deleted). |
Azure Data Factory - Integration Runtime Types
Type | Capability | Network Support |
---|---|---|
Azure | Data Flow, Data Movement, Activity Dispatch | Public |
Self-Hosted | Data Movement, Activity Dispatch | Public or Private |
Azure-SSIS | SSIS Package Execution | Public or Private |